This code was an answer to my own question on SO, however I am looking at the line global X and wondering if there is a better way to do this. I try to minimize the amount of global declarations in my code in order to avoid namespace collisions. I'm considering changing this code to use multiprocessing.shared_memory, but I would like some feedback on the code below 'as is'.
The purpose of this code is to compute in parallel the Pearson's product-moment correlation coefficient on all pairs of random variables. The columns of NumPy array X index the variables, and the rows index the sample.
$$r_{x,y} = \frac{\sum_{i=1}^{n}(x_i- \bar{x})(y_i- \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i- \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i- \bar{y})^2}}$$
This is actual code that should run on your machine (ex. python 3.6).
from itertools import combinations
import numpy as np
from scipy.stats import pearsonr
from multiprocessing import Pool
X = np.random.random(100000*10).reshape((100000, 10))
def function(cols):
result = X[:, cols]
x,y = result[:,0], result[:,1]
result = pearsonr(x,y)
return result
def init():
global X
if __name__ == '__main__':
with Pool(initializer=init, processes=4) as P:
print(P.map(function, combinations(range(X.shape[1]), 2)))
In addition to considering global X, any constructive feedback and suggestions are welcome.
globalis never needed in Python (the legitimate exceptions are truly rare). Even more strongly, one can say theglobalwill never help with any kinds of shared-memory or parallelism problems, because it doesn't address the real issue — namely, what happens when two processes/threads operate on the same data and the same time. \$\endgroup\$