Revisions to Multiple Pandas Ranking Operations within a Loop - Better Optimization and Performance

Tweeted twitter.com/StackCodeReview/status/1303800364899225604

occurred Sep 9, 2020 at 21:00

edited body

Source Link

edited Sep 9, 2020 at 13:00

171
4

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The above code above is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

On a separate note, I not experienced in pandas so my code may seem amateur and for that I kindly seek your guidance.

Thank you so much in advance,

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The above code is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

On a separate note, I not experienced in pandas so my code may seem amateur and for that I kindly seek your guidance.

Thank you so much in advance,

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The code above is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

On a separate note, I not experienced in pandas so my code may seem amateur and for that I kindly seek your guidance.

Thank you so much in advance,

edited tags

Link

edited Sep 9, 2020 at 5:35

aBiologist

171
4

added 121 characters in body

Source Link

edited Sep 9, 2020 at 5:26

aBiologist

171
4

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The above code is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

On a separate note, I not experienced in pandas so my code may seem amateur and for that I kindly seek your guidance.

Thank you so much in advance,

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The above code is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

Thank you so much in advance,

I have implemented the following code which works as intended. However, I would like to improve my code in terms of performance and efficiency

Code in Question

import pandas as pd
from scipy.stats import norm

# data frame of length 40,000 rows, containing 25 columns
for indx in df.index:
    matrix_ordered_first = df.loc[indx].rank(method='first',na_option='bottom')
    matrix_ordered_avg = df.loc[indx].rank(method='average', na_option='bottom')
    matrix_ordered_first.loc[df.loc[indx] == 0] = matrix_ordered_avg
    matrix_computed = norm.ppf(matrix_ordered_first / (len(df.columns) + 1))
    df.loc[indx] = matrix_computed.T

A peak of the dataframe

Here is a part view of my dataframe df:

s            s1     s2           s3         s4      ...     s21        s23        s24  s25
0            NaN   5.406999   5.444658   4.640154  ...  4.633389   5.517850       NaN  6.121492
1            NaN   2.147866   1.758245   1.274754  ...  1.465129   1.200157       NaN  1.789203
2       2.872652   5.492498   2.547415   3.754654  ...  3.686420   1.540947  4.405961  1.715685
3            NaN  46.316837  27.197062  72.910797  ...       NaN  46.812153       NaN       NaN
4       1.365775   1.329316   1.852473   1.208155  ...  1.489296   1.313321  1.462968  1.249645

[5 rows x 25 columns]

Explanation

The above code is the part of a long python script in which this part runs slower than the other parts of the program. So what I am trying to do in the above code is to iterate over the data frame in a row-wise fashion. Then, for each row I have to perform a chain of pandas ranking operations followed by a statistical test equivalent to the "One-tail test”.Finally, transpose the matrix which will then be fed as a row for the data frame.

How can I improve this block of code in terms of efficiency, speed, and performance?

On a separate note, I not experienced in pandas so my code may seem amateur and for that I kindly seek your guidance.

Thank you so much in advance,

added 19 characters in body

Source Link

edited Sep 9, 2020 at 5:13

aBiologist

171
4

Loading

added 19 characters in body

Source Link

edited Sep 9, 2020 at 5:06

aBiologist

171
4

Loading

deleted 3 characters in body

Source Link

edited Sep 9, 2020 at 4:56

aBiologist

171
4

Loading

edited title

Source Link

edited Sep 9, 2020 at 4:50

aBiologist

171
4

Loading

edited title

Link

edited Sep 9, 2020 at 4:43

aBiologist

171
4

Loading

Source Link

asked Sep 9, 2020 at 4:37

aBiologist

171
4

Loading

Stack Exchange Network

Return to Question