3

I have the following problem. I have 2 dataframes one with only 0's and columns name as the attributes which I know them from a different text file, and one which have column from first dataframe as values and NaN's for each row. Now, I want to set 1's on the dataframe with 0's, where the second dataframe values have the attribute.

the second data frame looks like this.

enter image description here

the first data frame looks like this.

enter image description here

and i want to change the first data frame into this.

enter image description here

for index, row in df.iterrows():
for element in row:
    if pd.isnull(element) : break
    # row index, element column so we change the value of the column thats named element from 0 to 1.
    Final_Df.at[index,element] = 1

This is the code I am using to achieve that. df is the second dataframe with NaN values, and Final_Df is the first dataframe with 0's. Is there a way to achieve it faster somehow by not using iterrows because the dataset is larger? Any help will be appreciated, and sorry if the question is bad. Thanks in advance!

4 Answers 4

1

Idea is create dictionary for each row in list comprehension, pass to DataFrame constructor, replace missing values to 0 and last use DataFrame.reindex for remove NaN column, change order and add non exist values to column filled by 0:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

Final_Df = (pd.DataFrame([dict.fromkeys(x, 1) for x in df.to_numpy()])
              .fillna(0)
              .astype(np.int8)
              .reindex(codes, axis=1, fill_value=0))

Another solution:

Use get_dummies with max values per columns (for always 0,1) values and then DataFrame.reindex for change order of columns and also add some non exist columns filled by 0:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

df = (pd.get_dummies(df, prefix='', prefix_sep='')
        .max(axis=1, level=0)
        .reindex(codes, axis=1, fill_value=0))
print (df)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, the second solution helped me a lot and is easier to understand. Also it reduced my execution time from 6 secs to 1.
1

To test my solution, I used the following DataFrame with smaller number of codes:

    0   1    2    3    4    5    6    7    8    9
0  fl  nc  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1  fl  nc  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2  ct  dc   fl   hi   il   ky   la   md   mi   ms
3  ct  dc   fl   il   ky   la   md   mi   ms   nc
4  hi  pr  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5  hi  il   ky   md   mi   ms  NaN  NaN  NaN  NaN
6  hi  il   ky   la   mi   ms  NaN  NaN  NaN  NaN
7  ct  la   md   ms   nc  NaN  NaN  NaN  NaN  NaN
8  dc  md   mi   ms   nc  NaN  NaN  NaN  NaN  NaN
9  dc  md   mi   nc  NaN  NaN  NaN  NaN  NaN  NaN

To create Final_Df I started with a list of codes:

codes = ['ca', 'ct', 'dc', 'fl', 'hi', 'il', 'ky', 'la', 'md', 'mi', 'ms', 'nc', 'pr']

and created Final_Df (full of zeroes) the following way:

Final_Df = pd.DataFrame(0, index=df.index, columns=codes)

I need also a dictionary to translate codes into column numbers, with -1 for NaN (these values will be omitted):

codeToInd = { code: ind for ind, code in enumerate(codes) }
codeToInd[np.nan] = -1

The first step of actual computation is to translate df into ind - a Numpy array:

ind = np.vectorize(codeToInd.get)(df)

The result is:

array([[ 3, 11, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 3, 11, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 1,  2,  3,  5,  6,  7,  8,  9, 10, 11],
       [ 4, 12, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 4,  5,  6,  8,  9, 10, -1, -1, -1, -1],
       [ 4,  5,  6,  7,  9, 10, -1, -1, -1, -1],
       [ 1,  7,  8, 10, 11, -1, -1, -1, -1, -1],
       [ 2,  8,  9, 10, 11, -1, -1, -1, -1, -1],
       [ 2,  8,  9, 11, -1, -1, -1, -1, -1, -1]])

One more preparatory step is to extract the underlaying Numpy array from Final_Df:

finDfVal = Final_Df.values

And the actual processing (setting *1*s at proper cells) is performed with the following loop:

for r, c in np.argwhere(ind >= 0):
    finDfVal[r, ind[r, c]] = 1

After that Final_Df contains:

   ca  ct  dc  fl  hi  il  ky  la  md  mi  ms  nc  pr
0   0   0   0   1   0   0   0   0   0   0   0   1   0
1   0   0   0   1   0   0   0   0   0   0   0   1   0
2   0   1   1   1   1   1   1   1   1   1   1   0   0
3   0   1   1   1   0   1   1   1   1   1   1   1   0
4   0   0   0   0   1   0   0   0   0   0   0   0   1
5   0   0   0   0   1   1   1   0   1   1   1   0   0
6   0   0   0   0   1   1   1   1   0   1   1   0   0
7   0   1   0   0   0   0   0   1   1   0   1   1   0
8   0   0   1   0   0   0   0   0   1   1   1   1   0
9   0   0   1   0   0   0   0   0   1   1   0   1   0

Execution speed

Using %timeit I compared the execution time of my code with yours and on this very limited data sample I got about 7 times shorter time.

I think that on a bigger DataFrame the difference should be greater. Write what was the execution time of your code and mine.

In case of an error

To check what is goin on, try to create a DataFrame counterpart of ind, just for display:

df.applymap(lambda x: codeToInd[x])

If everything is OK, a DataFrame with translated codes should be printed.

But in case of any missing value in codeToInd a KeyError exception is raised, showing the missing value. Add this missing value to codes and repeat the whole procedure.

4 Comments

on the line np.vectorize i get this error, TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
It actually doesn't convert the last column whith NaN's into -1 but it does for the rest. I mean after using applymap show that.
Start with a test run on my test data. I suppose that there is some disorder with your source data. E.g. NaN is in the DataFrame as a string, not "true" NaN (which is actually a special case of float).
Actually, they are as true NaN, I mean I've checked it and it shows class type float,but I don't know what's going on to be honest. I will try your code also, with your teset data and let you know and I really thank you ^^
0

I would suggest to use Pandas Vectorization. This tutorial is a good starting point:

https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

1 Comment

I checked this but didn't find a good way to implement any of this.
0

If it is not important that rows or columns that are all NaN are included you can also try the following:

  • Melt your dataframe:
# >>> df
#      0    1    2    3    4
# 0  NaN  NaN   ct  NaN  NaN
# 1  NaN  NaN  NaN  NaN  NaN
# 2   ta  NaN  NaN   ga  NaN
# 3  NaN  NaN  NaN  NaN  NaN
# 4  NaN  NaN  NaN  NaN  NaN

molten = pd.melt(df.T)

# >>> molten
#     variable value
# 0          0   NaN
# 1          0   NaN
# 2          0    ct
# 3          0   NaN
# 4          0   NaN
# 5          1   NaN 
  • use pandas.crosstab to tabulate the entries:
tab = pd.crosstab(molten["variable"], molten["value"])

# >>> tab2
# value     ct  ga  ta
# variable
# 0          1   0   0
# 2          0   1   1

1 Comment

Thank for your answer, but I do need the NaN values as 0's

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.