0

Suppose i have the following dataframe:

     xx      yy      tt
0   2.8     1.0     1.0
1   85.0    4.48    6.5
2   2.1     8.0     1.0
3   8.0     1.0     0.0
4   9.0     2.54    1.64
5   5.55    7.25    3.15
6   1.66    0.0     4.0
7   3.0     7.11    1.98
8   1.0     0.0     4.65
9   1.87    2.33    0.0

What i want to do with it to create a for loop that iterates over all points in the df and calculate the euclidean distance to all the other points. For instance: the loop would iterate over point a and get the distances from point a to point b,c,d...n. Then it would go to point b, and it would get the distances to points a,c,d...n, and so on.

Once i get the distances, i want to have a value_counts() of the distances values, but for memory saving sake, i can't just value_counts() all the results i get from this foor loop, because my real df is too big, and i will end up running out of memory.

So what i thought, is to perform the value_counts() operation to the distance vector, this will give a 2 columns dataframe with the values and their respective counts, then when it iterates over point b and get all the distances, i want to compare the new values with the previous value_counts() df from the the first loop and check if there are any repeated values, if yes, then i want to += the counter for the repeated values, if no repeated values found, i want to append() all those rows with no repeated values to the distance df.

This is what i've got so far:

import pandas as pd

counts = pd.DataFrame()

for index, row in df.iterrows():

    dist = pd.Series(np.sqrt((row.xx - df.xx)**2 + (row.yy - df.yy)**2 + (row.tt - df.tt)**2)) # Create a vector containing all the distances from each point to the others

    counter = pd.Series(dist.value_counts(sort = True)).reset_index().rename(columns = {'index': 'values', 0:'counts'}) # Get a counter for every value in the distances vector

    if index in counter['values']:
        counter['counts'][index] += 1 # Check if the new values are in the counter df, if so, add +1 to each repeated value

    else:

        counts = counts.append((index,row)) # If no repeated values, then append new rows to the counter df

The expected result would be something like:

# These are the value counts for point a and its distances:

    values  counts
0   0.000000    644589
1   0.005395    1
2   0.005752    1
3   0.016710    1
4   0.023043    1
5   0.012942    1
6   0.020562    1

Now in the iteration over point b:

       values   counts
0   0.000000    644595  # Value repeated 6 times, so add +6 to the counter
1   0.005395    1
2   0.005752    1
3   0.016710    3  # Value repeated twice, so add +2 to the counter
4   0.023043    1
5   0.012942    1
6   0.020562    1
7   0.025080    1  # New value, so append a new row with value and counter
8   0.022467    1  # New value, so append a new row with value and counter

However, if you add print (counts) to the end of the loop to check the results of what this loop is doing, you'll see an empty dataframe. ANd that's why i'm asking this question. Why is this code giving an empty df, and how can i get this to work the way i want it to?

If you need more extra explanations, something is not clear, or need more information, please do not hesitate to ask for it.

Thanks in advance

3
  • it because your loop is never going to the else condition, that's why your dataframe is empty Commented Mar 18, 2019 at 10:35
  • 1
    hum what is combination? is it a special library? Commented Mar 18, 2019 at 11:45
  • no, is the df. Give me a second and i will edit the question so it will be clearer Commented Mar 18, 2019 at 11:46

1 Answer 1

1

if understand you, you want the occurence of each distance values:

so i suggest you to create a dict: keys are values and values of keys are the count:

data = """
   xx      yy      tt
2.8     1.0     1.0
85.0    4.48    6.5
2.1     8.0     1.0
8.0     1.0     0.0
9.0     2.54    1.64
5.55    7.25    3.15
1.66    0.0     4.0
3.0     7.11    1.98
1.0     0.0     4.65
1.87    2.33    0.0
"""

import pandas as pd
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

dico ={}                            #i initialize the dict dico
for index, row in df.iterrows():
    dist = pd.Series(np.sqrt((row.xx - df.xx) ** 2 + (row.yy - df.yy) ** 2 + 
          (row.tt - df.tt) ** 2))   # Create a vector containing all the 
                                    #distances from each point to the others

    for f in dist:                  #i iterate through dist
        if f in dico:               #the key already exists in dict?
            dico[f] +=dico[f]       #yes i increment the value
        else:
            dico[f]=1               #no i create the key with the new distance and set to 1

print(dico)

output:

{0.0: 512, 
82.45726408267497: 2, 
7.034912934784623: 2, 
5.295280917949491: 2, 
6.4203738208923635: 2, 
7.158735921934822: 2, 
3.361487765856065: 2, 
6.191324575565393: 2, 
4.190763653560053: 2, 
1.9062528688503002: 2, 
83.15678204452118: 2, 
77.35218419669867: 2, 
76.17993961667337: 2, 
79.56882492534372: 2, 
    :
    :
7.511863949779708: 2,
0.9263368717696604: 2, 
4.633896848226123: 2, 
7.853725230742415: 2, 
5.295819105671946: 2, 
5.273357564208974: 2}

each values have at least 2 counts because its a crosstab and distance (point0 to point1) equaal distance(point1 to point0) ....

Sign up to request clarification or add additional context in comments.

4 Comments

Hi again Frenchy. This is a bit closer to what i wanted, but does this compare the new count values to the previous ones and add them to the dict if they are not already in the dict?? Also, remember that if some new value was already in the dict, you just have to add +1 to that value's counter. Are these 2 conditions fulfilled? Thank you very much
i have added comments in prog is it ok? i have done from what i have understood (sorry for my english). with 600000 rows execution time will be long...
Ok, That's great. I understood everything now. Thank you very much for your answer. It helped a lot!! and no worries for the english :)
enjoy to help you!!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.