Revisions to Calculating T-Test within Large Pandas Dataframes

Code-quote identifiers

Source Link

edited Jan 8, 2023 at 10:32

88.3k
14
104
327

make each calculated ttest faster

cProfile indicates that ttest()ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessingmultiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttestttest(). Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest. Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest(). Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

added 85 characters in body

Source Link

edited Jan 7, 2023 at 19:11

J_H

42.1k
3
38
157

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool, starmap

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest. Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool, starmap

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in pool.starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest. Delaying big imports until you're down in the child process can also be helpful.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

Source Link

answered Jan 7, 2023 at 19:05

J_H

42.1k
3
38
157

make each calculated ttest faster

cProfile indicates that ttest() dominates the running time. (Thank you for a reprex, BTW!)

I don't see any mathematical tricks that would let us do fewer tests because some of them are identical. Adjusting the permutations= parameter produced no useful effects. So tl;dr: "no".

parallelize this process to run on many different dataframes at once

That's straightforward, given that memory now is no object. Your code already has nice looping structure due to itertools.product().

The standard solution uses multiprocessing module. We'll need a slightly different loop structure.

from multiprocessing import Pool, starmap

def ttest(rna_pos, rna_neg, pr):
    t, p = stats.ttest_ind(rna_pos, rna_neg)
    return '{}\t{}\t{}\n'.format(t, p, pr)

work = [(np.array(rnadf.loc[pr[0]][cnvdf_mask.loc[pr[1]]].dropna()),
         np.array(rnadf.loc[pr[0]][~cnvdf_mask.loc[pr[1]]].dropna()),
         str(pr),
        )
    for pr in itertools.product(rnadf.index, cnvdf_mask.index)
]

with open('out_tab.txt', 'w') as f:
    with Pool() as pool:
        for result in starmap(ttest, work):
            f.write(result)

This will try to burn all cores. Each has its own interpreter, its own GIL. It's worth noting that cost to serialize / deserialize args + results should be much less than cost of running ttest.

If you want to use the CPUs of several hosts, then Dask or Vaex are happy to help.

Stack Exchange Network

Return to Answer