2

Suppose I have a 750x750 matrix placed in a DataFrame, say df.

df=

        c1   c2   c3  ... c750
c1      5    2    5   ...   3 
c2      3    1    5   ...   80
c3      4    2    7   ...   10
.       .    .    .   ...   .
.       .    .    .   ...   .
.       .    .    .   ...   .
c750    8    3    5   ...   1

I want to find out the 4 highest-value containing column for each row, which I can easily do it by:

a = df.values
a.sort(axis=1)
sorted_table = a[:,-4::]
b = a[:,::-1]

However, the result I get is just a list, without the index and column name.

[[ 98.      29.      15.      10.]
 [ 93.      91.      75.      60.]
 [ 48.      21.      17.      10.]
.
.
.
...]

What should I do if I want to know which column name is the sorted-values referring to?

I would like to display:

 df=

c1      c512    c20    c57     c310 
c2      c317    c133   c584    c80
c3      c499    c289   c703    c100
.       .    .    .   ...    .
.       .    .    .   ...    .
.       .    .    .   ...    .
c750    c89    c31    c546     c107

where

  c512 is referring  to 98

  c20 is referring to 29

  c57 is referring to 15

and so and so.
3
  • You could use df.apply(myfunc, axis=1) instead of df.sort. This will allow you to manipulate the column names together with their values. Commented Nov 29, 2015 at 5:49
  • Do you have an example of the output you desire? A problem I see is that a column might have several rows with highest values so sorting by that row might display the way you want. How do you want to display which column names belong to each row with the highest values? Commented Nov 29, 2015 at 22:50
  • @Jarad, I would want to display the data as updated above. Hope you can give me some ideas for it. Commented Nov 30, 2015 at 5:02

1 Answer 1

1

I doubt this is the best answer but I think it works. I hate using for loops in pandas but I couldn't think of a pandas way to do it.

import pandas as pd
import numpy as np

#array_size = 10

#--- Generate Data and create toy Dataframe ---
array_size = 750
np.random.seed(1)
data = np.random.randint(0, 1000000, array_size**2)
data = data.reshape((array_size, array_size))
df = pd.DataFrame(data, columns=['c'+str(i) for i in range(1, (array_size)+1)])
df.index = df.columns

#--- Transpose the dataframe to more familiarly sort by columns instead of rows ---
df = df.T

#--- Rank values in dataframe using max method where highest value is rank 1 ---
df = df.rank(method='max', ascending=False)

#--- Create empty dataframe to put data into ---
new_df = pd.DataFrame()

#--- For loop for each column to get top ranks less than 5, sort them, reset index, drop i column
for i in df.columns:
  s = df[i][df[i] < 5].sort_values().reset_index().drop(i, axis=1)
  new_df = pd.concat([new_df, s.T])

#--- The new_df index will say 'index', this reassigns the transposed column names to new_df's index
new_df.index = df.columns
print(new_df)

Outputs:

         0     1     2     3
c1    c479  c545  c614  c220
c2    c249  c535  c231  c680
c3    c657  c603  c137  c740
c4    c674  c424  c426  c127
...    ...   ...   ...   ...
c747  c251  c536  c321  c296
c748   c55  c383  c437  c103
c749  c138  c495  c299  c295
c750  c178  c556  c491  c445
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.