dask

If you join Dask DataFrame on a categorical column, then the outputted Dask DataFrame column is still category dtype. However, the moment you .compute() the outputted Dask DataFrame, then the column is the wrong dtype, not categorical.

Tested on Dask 2.14.0 and Pandas 1.0.3
This example where the category type looks like a float, so after .compute(), the dtype is float.

import dask.d

MCVE Code Sample

# Your code here
import numpy as np
import xarray as xr

data = np.zeros((10, 4))
example_xr = xr.DataArray(data, coords=[range(10), ["

@codyschank

User @codyschank had noticed that for small datasets, stumpy.stomp._stomp is faster than stumpy.stump. Here is some very rough timing calculations from my 2-core laptop:

    length       stomp       stump  stomp/stump   stump/stomp
0      128    0.006628    0.018066     0.366867      2.725782
1      256    0.

@jmcarpenter2

Hi @jmcarpenter2,

Dear Swifter Folks,

Recently, i found the speed when using swifter is 5-10x slower than using vanilla pandas apply for case that the process is not vectorized (my case is doing text preprocessing).

The experiment is like this:

import pandas as pd
import swifter

def clean_text(text):
    text = text.strip()
    text = text.replace(' ', '_')
    retu

In some workloads with highly compressible data we would like to trade off some computation time for more in-memory storage automatically. Dask workers store data in a MutableMapping (the superclass of dict). So in principle all we would need to do is make a MutableMapping subclass that overrides the getitem and setitem methods to compress and decompress data on demand.

This would be an i

Hello. I am trying to migrate my project from basic logging to something more advanced and someone recommended this module through reddit. I have been through the quick-start guide and other available documentation and have some very basic questions about the API.

How can I parse the logs and format them for the stdout?

Is there a way to stream what's being written to the log, just like the

Hello,

I haven't tested append() yet, and I was wondering if duplicates are removed when an append is managed.
I had a look in collection.py script and following pandas function are used:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

After a look into pandas documentation, I understand that duplicate lines are removed, only the last occurence is kept.

I'm using nearest_s2d in combination with the add_matrix_NaNs' 'hack' documented in #15 , but I'm getting 'smearing' of the nearest value all the way to the border with nearest_s2d when I would expect it to behave like bilinear with the 'outside' values instead missing.
Is this the intended behaviour? Can I work around this somehow by masking the output again?

![ex](https://user-images.githubu

This was marked skipped as part of #307.

Now ._call is a static method of Job so the test should be moved to test_job.py and adapted.

Now that airspeed-velocity/asv#449 is fixed, we could do a proper test of our benchmarks during CI.

For the CLI, the current default log level has dask_ec2 set to DEBUG and paramiko set to WARNING. While keeping the default as is, the addition of the following logging options would be helpful:

--quiet, -q: both dask_ec2 and paramiko have log level of WARNING
--verbose, -v: both dask_ec2 and paramiko have log level of DEBUG

For large datasets where computing the summary may be expensive, it would be useful to compute only part of it, be able to explore it, and then compute other parts of it without recomputing the initial report.

The selection of which parts to compute could be by:

columns in the dataset,
metrics, or
row ranges.

@kmpaul

Identify all of parts, create an outline
Branch/PR just for part 1
- pair programming review w @kmpaul
- get xdev team to read/add missing content/create PR
- repeat
Create Nikola md page for part 1
Reach out to Beta testers of part 1
Post and Announcement

Repeat for part 2

References:

#105 -- for 0 to 30 tutori

I think some convenience functions to download recent datasets like SubX would be nice.
(The same way we could include CMIP6.)

See NCAR/intake-esm-datastore#46

@ifenty

Over in ECCO-GROUP/ECCOv4-py#6, @ifenty reported some difficulty in using open_mdsdataset to read data from ECCO. Some if this is likely due to our lousy error messages (see #126), but it's also likely related to overall deficiencies in our documentation.

This is what I wrote in that thread.

I think a big source of confusion is that the user-facing part of xmitgcm is designed not to read j

I'd prefer the mesos kind of logos like here http://aurora.apache.org/ and here http://mesos.apache.org/.

Same triangular tiling but with a capital D clorized like dask's logo https://github.com/dask

Replace HLG monkey patching with dask's ensure_dict
Generation of API documentation fails
Delayed example in the README uses incorrect keyword args

Currently the layout is not quite right (gitter chat shows up in the middle of the README).
It does clearly mention Kitware tools
Flow is not quite right as it is too detail at few places.

Jun	JUL	Aug
	02
2019	2020	2021

dask

Here are 152 public repositories matching this topic...

dask / dask

pydata / xarray

MCVE Code Sample

TDAmeritrade / stumpy

jmcarpenter2 / swifter

dask / distributed

itamarst / eliot

ranaroussi / pystore

timkpaine / paperboy

JiaweiZhuang / xESMF

dask / dask-jobqueue

JDASoftwareGroup / kartothek

dask / dask-ec2

facultyai / lens

dymaxionlabs / dask-rasterio

chmp / framequery

dask / knit

NCAR / ncar-python-tutorial

JSybrandt / agatha

bradyrx / climpred

jrbourbeau / madpy-dask

MITgcm / xmitgcm

lesommer / oocgcm

daskos / daskos

radix-ai / graphchain

OpenDataAnalytics / gaia

sinhrks / daskperiment

msalvaris / DaskMaskRCNN

ml-tooling / lazycluster

NCAR / esmlab

nurullahisik / php-uavt-adreskodu-botu

Improve this page

Add this topic to your repo