Skip to main content
3 of 3
Code-quote identifiers
Toby Speight
  • 88.3k
  • 14
  • 104
  • 327
  1. make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

  1. parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest(). Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

J_H
  • 42.2k
  • 3
  • 38
  • 157