- make each calculated ttest faster
cProfile
indicates that ttest()ttest() dominates the running time.
(Thank you for a reprex, BTW!)
I don't see any mathematical tricks that would let us do
fewer tests because some of them are identical.
Adjusting the permutations= parameter produced
no useful effects. So tl;dr: "no".
- parallelize this process to run on many different dataframes at once
That's straightforward, given that memory now is no object.
Your code already has nice looping structure due to itertools.product().
The standard solution uses
multiprocessingmultiprocessing
module.
We'll need a slightly different loop structure.
from multiprocessing import Pool
def ttest(rna_pos, rna_neg, pr):
t, p = stats.ttest_ind(rna_pos, rna_neg)
return '{}\t{}\t{}\n'.format(t, p, pr)
work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
str(pr),
)
for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]
with open('out_tab.txt', 'w') as f:
with Pool() as pool:
for result in pool.starmap(ttest, work):
f.write(result)
This will try to burn all cores.
Each has its own interpreter, its own
GIL.
It's worth noting that cost to serialize / deserialize
args + results should be much less than cost of running ttestttest().
Delaying big imports until you're down in the
child process can also be helpful.
If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.