I read data from Excel into a Pandas DataFrame, so that every column represents a different variable, and every row represents a different sample. I made the function below to identify potential highly correlated variables within a DataFrame, with 'high correlation' being determined by the given "threshold" input when calling the function.
import pandas as pd
def find_highly_correlated_variables(dataframe, threshold):
'''
Parameters
----------
dataframe : pandas.DataFrame
threshold : float, representing minimal absolute value for correlation between variables to be selected
Output
------
string : reading how no highly correlated variables have been found, if none have been found
list : containing pair(s) of highly correlated variables if one or more have been found
'''
# Initialization of variables and lists to work with.
df = dataframe
th = threshold
column_names = list(df.columns.values)
highly_correlated_indices = []
highly_correlated_variables = []
# Correlation matrix is created, so that correlation values can be accessed easily.
correlation_array = df.corr().values.tolist()
for i_column, column in enumerate(correlation_array):
for i_element, element in enumerate(column):
if (abs(element) >= th) & (abs(element) != 1.0):
# Prevent duplicate information from being added.
if [i_column, i_element] not in highly_correlated_indices:
highly_correlated_indices.append([i_element, i_column])
# 'Translate' element and column indices into the variable names.
for indices in highly_correlated_indices:
highly_correlated_variables.append([column_names[indices[0]], column_names[indices[1]]])
if len(highly_correlated_indices) == 0:
print("No highly correlated variables found.")
else:
return highly_correlated_variables
I know that nested for loops are not ideal with regard to time complexity, so I tried to solve it using the 'zip' function and somehow do it as follows:
for index, (column, element) in enumerate(zip(correlation_array, column)
Though I got stuck in trying to make such a solution work.
For that reason, I was quite curious whether it would be possible to improve that part of the code so that it speeds up the process compared to what it is now.
I wouldn't mind hearing other tips for improvement of course (e.g. maybe some parts can be more compact), so please don't hesitate to share such thoughts with me.