Compute pairwise Pearson's R in parallel with tasks separated by pairs of columns of an array [closed]

Question

Closed. This question is off-topic. It is not currently accepting answers.

Missing Review Context: Code Review requires concrete code from a project, with enough code and / or context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site.

Closed 4 years ago.

Improve this question

This code was an answer to my own question on SO, however I am looking at the line global X and wondering if there is a better way to do this. I try to minimize the amount of global declarations in my code in order to avoid namespace collisions. I'm considering changing this code to use multiprocessing.shared_memory, but I would like some feedback on the code below 'as is'.

The purpose of this code is to compute in parallel the Pearson's product-moment correlation coefficient on all pairs of random variables. The columns of NumPy array X index the variables, and the rows index the sample.

$$r_{x,y} = \frac{\sum_{i=1}^{n}(x_i- \bar{x})(y_i- \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i- \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i- \bar{y})^2}}$$

This is actual code that should run on your machine (ex. python 3.6).

from itertools import combinations
import numpy as np
from scipy.stats import pearsonr
from multiprocessing import Pool

X = np.random.random(100000*10).reshape((100000, 10))

def function(cols):
    result = X[:, cols]
    x,y = result[:,0], result[:,1]
    result = pearsonr(x,y)
    return result

def init():
    global X 

if __name__ == '__main__':
    with Pool(initializer=init, processes=4) as P:
        print(P.map(function, combinations(range(X.shape[1]), 2)))

In addition to considering global X, any constructive feedback and suggestions are welcome.

Please clarify the purpose of the code. What problem does it solve? Perhaps taking a look at the help center will help in determining whether you posted your question in the right place, especially the part talking about example code might be relevant. — Mast
– Mast ♦, Commented Jul 21, 2021 at 15:18
"toy examples", sadly, are explicitly off-topic for CodeReview. If you show this code in its real context we are more likely able to help you. — Reinderien
– Reinderien, Commented Jul 21, 2021 at 15:23
@Reinderien Toy examples being off-topic surprises me. I expect they should have clearer and simpler scope and behaviour, and consequently easier to explain and understand. I will have to think on whether to edit or delete this question. Thanks for the feedback. — Galen
– Galen, Commented Jul 21, 2021 at 15:27
To a first approximation, global is never needed in Python (the legitimate exceptions are truly rare). Even more strongly, one can say the global will never help with any kinds of shared-memory or parallelism problems, because it doesn't address the real issue — namely, what happens when two processes/threads operate on the same data and the same time. — FMc
– FMc, Commented Jul 21, 2021 at 15:48
The applied edits do not address the close reason. Note that the close reason is essentially saying that there isn't enough code to review. So to address it, you would need to add more code. Of course, since there is an answer, you can't do that. Rather than throwing it into the Reopen queue, a better approach would be to discuss it on meta. — mdfst13
– mdfst13, Commented Aug 6, 2021 at 3:33

Miguel Alorda · Accepted Answer · 2021-07-21 17:01:01Z

1

If you are not going to write anything to X (which it seems you are not doing), and just going to read from it, you should be good to just have all processes access the same variable without some locking mechanism.

Now, global is not necesarily the way to go. Here is different approach:

from itertools import combinations
import numpy as np
from scipy.stats import pearsonr
from multiprocessing import Pool


class MyCalculator:

    def __init__(self, X):
        self.X = X

    def function(self, cols):
        result = self.X[:, cols]
        x,y = result[:,0], result[:,1]
        result = pearsonr(x,y)
        return result


def main():
    X = np.random.random(100000*10).reshape((100000, 10))
    myCalculator = MyCalculator(X)
    with Pool(processes=4) as P:
        print(P.map(myCalculator.function, 
                    combinations(range(X.shape[1]), 2)))


if __name__ == '__main__':
    main()

edited Jul 21, 2021 at 17:01

answered Jul 21, 2021 at 16:39

Miguel Alorda

1,4137 silver badges14 bronze badges

\$\begingroup\$ Yes, good work on inferring that I am interested in the read-only cases. \$\endgroup\$

Galen
– Galen

2021-07-21 16:43:34 +00:00
Commented Jul 21, 2021 at 16:43
\$\begingroup\$ Tried to run the code in this answer and got a traceback: AttributeError: Can't pickle local object 'main.<locals>.<lambda>'. \$\endgroup\$

Galen
– Galen

2021-07-21 16:44:00 +00:00
Commented Jul 21, 2021 at 16:44
\$\begingroup\$ My bad, I'll update the approach. Nested function cannot be pickeld \$\endgroup\$

Miguel Alorda
– Miguel Alorda

2021-07-21 16:58:06 +00:00
Commented Jul 21, 2021 at 16:58
1

\$\begingroup\$ I have found that the Pathos library uses dill rather than pickle for serialization. Dill can serialize nested functions. \$\endgroup\$

Galen
– Galen

2022-04-17 16:18:15 +00:00
Commented Apr 17, 2022 at 16:18

Add a comment |

Stack Exchange Network

Compute pairwise Pearson's R in parallel with tasks separated by pairs of columns of an array [closed]

1 Answer 1

Hot Network Questions

Compute pairwise Pearson's R in parallel with tasks separated by pairs of columns of an array [closed]

1 Answer 1

Related

Hot Network Questions