Suppose i have the following dataframe:
xx yy tt
0 2.8 1.0 1.0
1 85.0 4.48 6.5
2 2.1 8.0 1.0
3 8.0 1.0 0.0
4 9.0 2.54 1.64
5 5.55 7.25 3.15
6 1.66 0.0 4.0
7 3.0 7.11 1.98
8 1.0 0.0 4.65
9 1.87 2.33 0.0
What i want to do with it to create a for loop that iterates over all points in the df and calculate the euclidean distance to all the other points. For instance: the loop would iterate over point a and get the distances from point a to point b,c,d...n. Then it would go to point b, and it would get the distances to points a,c,d...n, and so on.
Once i get the distances, i want to have a value_counts() of the distances values, but for memory saving sake, i can't just value_counts() all the results i get from this foor loop, because my real df is too big, and i will end up running out of memory.
So what i thought, is to perform the value_counts() operation to the distance vector, this will give a 2 columns dataframe with the values and their respective counts, then when it iterates over point b and get all the distances, i want to compare the new values with the previous value_counts() df from the the first loop and check if there are any repeated values, if yes, then i want to += the counter for the repeated values, if no repeated values found, i want to append() all those rows with no repeated values to the distance df.
This is what i've got so far:
import pandas as pd
counts = pd.DataFrame()
for index, row in df.iterrows():
dist = pd.Series(np.sqrt((row.xx - df.xx)**2 + (row.yy - df.yy)**2 + (row.tt - df.tt)**2)) # Create a vector containing all the distances from each point to the others
counter = pd.Series(dist.value_counts(sort = True)).reset_index().rename(columns = {'index': 'values', 0:'counts'}) # Get a counter for every value in the distances vector
if index in counter['values']:
counter['counts'][index] += 1 # Check if the new values are in the counter df, if so, add +1 to each repeated value
else:
counts = counts.append((index,row)) # If no repeated values, then append new rows to the counter df
The expected result would be something like:
# These are the value counts for point a and its distances:
values counts
0 0.000000 644589
1 0.005395 1
2 0.005752 1
3 0.016710 1
4 0.023043 1
5 0.012942 1
6 0.020562 1
Now in the iteration over point b:
values counts
0 0.000000 644595 # Value repeated 6 times, so add +6 to the counter
1 0.005395 1
2 0.005752 1
3 0.016710 3 # Value repeated twice, so add +2 to the counter
4 0.023043 1
5 0.012942 1
6 0.020562 1
7 0.025080 1 # New value, so append a new row with value and counter
8 0.022467 1 # New value, so append a new row with value and counter
However, if you add print (counts) to the end of the loop to check the results of what this loop is doing, you'll see an empty dataframe. ANd that's why i'm asking this question. Why is this code giving an empty df, and how can i get this to work the way i want it to?
If you need more extra explanations, something is not clear, or need more information, please do not hesitate to ask for it.
Thanks in advance