Error when applying .map_partition on a column over a dask dataframe

Question

I recently decided to be more adventurous and try to explore more DASK dataframes. I am trying to apply a specific function to one of the column dataframe, the syntax that I am using is the following:

import pandas as pd
import dask.dataframe as dd
import dask.array as da

df_data = pd.DataFrame({'Column 1': [300,300,450,500,500,750,600,300, 150],'Column 2': [100,130,230,200,300,350,600,550,530], 'Column 3': [250, 300, 400, 500, 700,350, 750, 550, 600]})

def TestFunc(x):
    y = x*2 + abs(x/2 - x*3)
    return y

dd_data = dd.from_pandas(df_data, npartitions = 1)
data_test = dd.map_partitions(TestFunc,dd_data['Column 1'])
data_test.compute()

Naturally is a simpler example that I just made up to show how what I have been doing. This code is working well, the problem is on the real situation that I am facing. Now, I have a more complex dataframe where I want to apply a function to one column. I am applying the following function:

 def GetID(phase):
     nDataPoints = len(phase)
     myRanges = np.deg2rad(np.arange(0,360,6))
     phase[phase>np.deg2rad(354+3)] = 0
     ID = np.array([])
     for i in np.arange(0,nDataPoints):
         val = abs(myRanges-phase[i])
         iID = np.argmin(val)
         ID = np.append(ID, iID+1)
     return ID

I am able to apply the function to the column with .map_partitions, the problem is that when I try to use after .compute() to see the numerical results I receive an error Key error: 0. I don't understand how I would have no problem with my previous simpler example and with the situation that I am facing.

Hope that I managed to be succinct and precise. I would really appreciate your help on this one! Suggestions of what to look up for are also welcome

MRocklin · Accepted Answer · 2019-11-03 15:30:41Z

1

I recommend trying your function on a normal Pandas dataframe to verify that it is working correctly:

GetID(df.compute())

If that works then I would next try using the single threaded scheduler, along with the pdb module to investigate the traceback

df.map_partitions(GetID).compute(scheduler='single-threaded')

This is easy to do if you are in IPython with the %debug magic.

answered Nov 3, 2019 at 15:30

MRocklin

57.5k29 gold badges175 silver badges245 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Error when applying .map_partition on a column over a dask dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related