Finding highly correlated variables in a dataframe by evaluating its correlation matrix's values

Question

I read data from Excel into a Pandas DataFrame, so that every column represents a different variable, and every row represents a different sample. I made the function below to identify potential highly correlated variables within a DataFrame, with 'high correlation' being determined by the given "threshold" input when calling the function.

import pandas as pd

def find_highly_correlated_variables(dataframe, threshold):
    '''
    Parameters
    ----------
    dataframe :  pandas.DataFrame
    threshold :  float, representing minimal absolute value for correlation between variables to be selected
    
    Output
    ------
    string :  reading how no highly correlated variables have been found, if none have been found
    list   :  containing pair(s) of highly correlated variables if one or more have been found
    '''
    
    # Initialization of variables and lists to work with.
    df = dataframe
    th = threshold
    column_names = list(df.columns.values)
    highly_correlated_indices = []
    highly_correlated_variables = []
    
    # Correlation matrix is created, so that correlation values can be accessed easily.
    correlation_array = df.corr().values.tolist()
    
    for i_column, column in enumerate(correlation_array):
        for i_element, element in enumerate(column):
            if (abs(element) >= th) & (abs(element) != 1.0):
                # Prevent duplicate information from being added.
                if [i_column, i_element] not in highly_correlated_indices:
                    highly_correlated_indices.append([i_element, i_column])
    
    # 'Translate' element and column indices into the variable names.
    for indices in highly_correlated_indices:
        highly_correlated_variables.append([column_names[indices[0]], column_names[indices[1]]])

    
    if len(highly_correlated_indices) == 0:
        print("No highly correlated variables found.")  
    else:
        return highly_correlated_variables

I know that nested for loops are not ideal with regard to time complexity, so I tried to solve it using the 'zip' function and somehow do it as follows: for index, (column, element) in enumerate(zip(correlation_array, column) Though I got stuck in trying to make such a solution work.

For that reason, I was quite curious whether it would be possible to improve that part of the code so that it speeds up the process compared to what it is now.

I wouldn't mind hearing other tips for improvement of course (e.g. maybe some parts can be more compact), so please don't hesitate to share such thoughts with me.

\$\begingroup\$ Can you provide sample data? \$\endgroup\$

Reinderien
– Reinderien

2023-06-01 23:08:29 +00:00
Commented Jun 1, 2023 at 23:08 — Reinderien
– Reinderien, Commented Jun 1, 2023 at 23:08

J_H · Accepted Answer · 2023-06-01 23:24:11Z

nested [interpreted] for loops are not ideal with regard to time complexity

You meant regarding "time elapsed".

The big-Oh complexity was determined by the point we'd computed the covariance.

But profiling will reveal that time spent in interpreted bytecode tends to dominate time spent in numpy's compiled C code, as you were observing.

    highly_correlated_indices = []
    ...    
    for i_column, column in ...:
        for i_element, element in ...:
            if (abs(element) >= th) & (abs(element) != 1.0):
                if [i_column, i_element] not in highly_correlated_indices:
                    highly_correlated_indices.append(...)

This looks quadratic at first blush. But it's worse, it's cubic. Rather than a list you wanted a set there, so the in test could complete in O(1) constant time.

You are right that, instead of the interpreter examining one value at a time, it would be desirable to do a vectorized broadcast across the matrix. Here is one approach:

import numpy as np
import pandas as pd

rng = np.random.default_rng(1)
df = pd.DataFrame(rng.random((8, 3)))
c = df.corr().abs()
th = .4
z = c[(c != 1) & (c > th)]

>>> np.round(z, 2)
    0     1     2
0 NaN   NaN   NaN
1 NaN   NaN  0.44
2 NaN  0.44   NaN

I found it convenient to round to two places, but clearly you don't have to.

At this point you can readily iterate over the positive columns:

>>> np.round(z.sum(axis=0), 2)
0    0.00
1    0.44
2    0.44

Stack Exchange Network

Finding highly correlated variables in a dataframe by evaluating its correlation matrix's values

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Finding highly correlated variables in a dataframe by evaluating its correlation matrix's values

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions