dask
Here are 152 public repositories matching this topic...
MCVE Code Sample
# Your code here
import numpy as np
import xarray as xr
data = np.zeros((10, 4))
example_xr = xr.DataArray(data, coords=[range(10), ["User @codyschank had noticed that for small datasets, stumpy.stomp._stomp is faster than stumpy.stump. Here is some very rough timing calculations from my 2-core laptop:
length stomp stump stomp/stump stump/stomp
0 128 0.006628 0.018066 0.366867 2.725782
1 256 0.
Hi @jmcarpenter2,
Dear Swifter Folks,
Recently, i found the speed when using swifter is 5-10x slower than using vanilla pandas apply for case that the process is not vectorized (my case is doing text preprocessing).
The experiment is like this:
import pandas as pd
import swifter
def clean_text(text):
text = text.strip()
text = text.replace(' ', '_')
retuIn some workloads with highly compressible data we would like to trade off some computation time for more in-memory storage automatically. Dask workers store data in a MutableMapping (the superclass of dict). So in principle all we would need to do is make a MutableMapping subclass that overrides the getitem and setitem methods to compress and decompress data on demand.
This would be an i
Hello. I am trying to migrate my project from basic logging to something more advanced and someone recommended this module through reddit. I have been through the quick-start guide and other available documentation and have some very basic questions about the API.
How can I parse the logs and format them for the stdout?
Is there a way to stream what's being written to the log, just like the
Hello,
I haven't tested append() yet, and I was wondering if duplicates are removed when an append is managed.
I had a look in collection.py script and following pandas function are used:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
After a look into pandas documentation, I understand that duplicate lines are removed, only the last occurence is kept.
-
Updated
Feb 28, 2020 - Python
I'm using nearest_s2d in combination with the add_matrix_NaNs' 'hack' documented in #15 , but I'm getting 'smearing' of the nearest value all the way to the border with nearest_s2d when I would expect it to behave like bilinear with the 'outside' values instead missing.
Is this the intended behaviour? Can I work around this somehow by masking the output again?
This was marked skipped as part of #307.
Now ._call is a static method of Job so the test should be moved to test_job.py and adapted.
Now that airspeed-velocity/asv#449 is fixed, we could do a proper test of our benchmarks during CI.
For the CLI, the current default log level has dask_ec2 set to DEBUG and paramiko set to WARNING. While keeping the default as is, the addition of the following logging options would be helpful:
--quiet,-q: bothdask_ec2andparamikohave log level ofWARNING--verbose,-v: bothdask_ec2andparamikohave log level ofDEBUG
For large datasets where computing the summary may be expensive, it would be useful to compute only part of it, be able to explore it, and then compute other parts of it without recomputing the initial report.
The selection of which parts to compute could be by:
- columns in the dataset,
- metrics, or
- row ranges.
-
Updated
Apr 25, 2018 - Python
-
Updated
Jul 3, 2018 - Python
-
Identify all of parts, create an outline
-
Branch/PR just for part 1
- pair programming review w @kmpaul
- get xdev team to read/add missing content/create PR
- repeat -
Create Nikola md page for part 1
-
Reach out to Beta testers of part 1
-
Post and Announcement
Repeat for part 2
References:
- #105 -- for 0 to 30 tutori
-
Updated
Jun 17, 2020 - Python
I think some convenience functions to download recent datasets like SubX would be nice.
(The same way we could include CMIP6.)
Over in ECCO-GROUP/ECCOv4-py#6, @ifenty reported some difficulty in using open_mdsdataset to read data from ECCO. Some if this is likely due to our lousy error messages (see #126), but it's also likely related to overall deficiencies in our documentation.
This is what I wrote in that thread.
I think a big source of confusion is that the user-facing part of xmitgcm is designed not to read j
Nitpicking the docs
-
Updated
Nov 6, 2017 - Python
Daskos needs a logo
I'd prefer the mesos kind of logos like here http://aurora.apache.org/ and here http://mesos.apache.org/.
Same triangular tiling but with a capital D clorized like dask's logo https://github.com/dask
- Replace HLG monkey patching with dask's
ensure_dict - Generation of API documentation fails
- Delayed example in the README uses incorrect keyword args
- Currently the layout is not quite right (gitter chat shows up in the middle of the README).
- It does clearly mention Kitware tools
- Flow is not quite right as it is too detail at few places.
-
Updated
Apr 24, 2019 - Python
-
Updated
Apr 4, 2019 - Python
-
Updated
May 3, 2020 - Python
-
Updated
Nov 4, 2019 - Python
-
Updated
Jun 25, 2020 - PHP
Improve this page
Add a description, image, and links to the dask topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the dask topic, visit your repo's landing page and select "manage topics."


If you join Dask DataFrame on a categorical column, then the outputted Dask DataFrame column is still
categorydtype. However, the moment you.compute()the outputted Dask DataFrame, then the column is the wrong dtype, not categorical.Tested on Dask 2.14.0 and Pandas 1.0.3
This example where the category type looks like a float, so after .compute(), the dtype is float.