Skip to main content
deleted 45 characters in body; edited tags
Source Link
200_success
  • 145.6k
  • 22
  • 191
  • 481

ImI'm testing aPython 3 code to generateperform a Monte Carlo simulation based on the result of an statistical test.

ImI'm working with numpy to generate the array of values, and iI have this working code.

I know that the use of np.vectorize it'sis not good for speed, because behind curtains is a for loop, but currently iI don't understand another method for applying a function element wise in numpy.

So, using numpy or pandas or both, is there any way iI could improve this? in execution time and sintaxis, maybe.syntax?

The final idea is to return a numpy array with the simulated p-values to append it to the original dataframe.

Thanks in advance for your time.

Im testing a code to generate a Monte Carlo simulation based on the result of an statistical test.

Im working with numpy to generate the array of values, and i have this working code.

I know that the use of np.vectorize it's not good for speed, because behind curtains is a for loop, but currently i don't understand another method for applying a function element wise in numpy.

So, using numpy or pandas or both, is there any way i could improve this? in execution time and sintaxis, maybe.

The final idea is to return a numpy array with the simulated p-values to append it to the original dataframe.

Thanks in advance for your time.

I'm testing Python 3 code to perform a Monte Carlo simulation based on the result of an statistical test.

I'm working with numpy to generate the array of values, and I have this working code.

I know that the use of np.vectorize is not good for speed, because behind curtains is a for loop, but currently I don't understand another method for applying a function element wise in numpy.

So, using numpy or pandas or both, is there any way I could improve this in execution time and syntax?

The final idea is to return a numpy array with the simulated p-values to append it to the original dataframe.

Source Link
Kako
  • 147
  • 1
  • 7

Monte Carlo Simulation of P-Value

Im testing a code to generate a Monte Carlo simulation based on the result of an statistical test.

I currently have the result of the statistical test in a pandas dataframe, like this.

Dataframe A
+-----+-------+
| id  | f_res |
+-----+-------+
|   1 | 4.22  |
|   2 | 5.25  |
|   3 | 3.3   |
|   4 | 2.5   |
|   5 | 1.9   |
|   6 | 9.3   |
+-----+-------+

So my idea is; for each row in f_res, pass that value to a function and extract multiple values from a Noncentral chi-squared distribution, ask how many of this extracted values are greater than the original and divide this by the total of values analyzed.

Im working with numpy to generate the array of values, and i have this working code.

import numpy as np
import pandas as pd

total_sample = 100

def monte_carlo(x, tot_sample):
    gen_dist = np.random.noncentral_chisquare(df=1, nonc=x, size=tot_sample)
    compare = gen_dist > x
    return np.divide(np.sum(compare), tot_sample)

df = pd.DataFrame(np.random.randint(0,10, 625527), columns=[['A']])
b = np.array(df.values)
g = np.vectorize(monte_carlo)

x = g(b, total_sample)

I know that the use of np.vectorize it's not good for speed, because behind curtains is a for loop, but currently i don't understand another method for applying a function element wise in numpy.

Currently analyzing a dataframe of 625.000 items and with a total sample of 100, it takes approx 20 s. line_profiler shows the following.

Total time: 19.909 s File: test3.py Function: test at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile
     7                                           def monte_carlo(x, tot_sample):
     8    625528     33866564     54.1     58.2      gen_dist = np.random.noncentral_chisquare(df=1, nonc=x, size=tot_sample)
     9    625528      4621263      7.4      7.9      compare = gen_dist > x
    10    625528     19704575     31.5     33.9      return np.divide(np.sum(compare), tot_sample)

So, using numpy or pandas or both, is there any way i could improve this? in execution time and sintaxis, maybe.

The final idea is to return a numpy array with the simulated p-values to append it to the original dataframe.

Thanks in advance for your time.