Calculating edges of a social network

Question

I am working on an open Chess data-set (~15500 rows after cleaning), and I create nodes and edges. But the way I create the edges takes a bit of time.

A sample of my nodes tibble:

	player
1	bougris
2	a-00
3	ischia
...	...

A sample picture of the per_game tibble:

The way I do it:

I iterate for each node/player in nodes tibble,
searching the games where he played as black in per_game,
and exchanging values of column white_d with black_id, while changing the winner in winner column (using the swap() method I created).
Then, with calc_victories() method, I group all games for the specific player, with every opponent he has faced, and calculating how many times he won, or lost, from the opponent (I store it to player_result). An example picture:
Then I append the player_result to a tibble, with all previous players' results.
Finally, I delete from the per_game tibble the node I just handled, both from black and white columns.

Here is my code:

for(i in 1:dim(nodes)){
    # Exchange values of white with black column, only where black_id is the specific player
    per_game[per_game$black_id == nodes[[1]][i], c('white_id', 'black_id', 'winner')] <-
        per_game[per_game$black_id == nodes[[1]][i], c('black_id', 'white_id', swap('winner'))]
    
    # Calculate the victories, for each opponent of the specific player
    player_results <- calc_victories(nodes[[1]][i])
    
    # Append the player's matches with the rest.
    all_results <- rbind(all_results, player_results)
    
    # Delete all matches with the specific player, either if he/she is black or white
    per_game <- subset(per_game, white_id != nodes[[1]][i] & black_id != nodes[[1]][i])
}
all_results

Here are the functions calc_victories() and swap():

# A method to group the matches for a player, and sum his victories against each opponent
calc_victories <- function(i='-') {
    player_results <- per_game %>% 
        filter(white_id==i | black_id==i) %>% # Finds matches with the specific player
        group_by(white_id, black_id) %>%
        rename(player1=white_id, player2=black_id) %>%
        summarise_at(vars(total_matches), list(victories = sum)) %>% # Summarize total matches
        arrange(desc(victories)) %>% # Sorts descending
        ungroup()

    return (player_results)
}

# A method to change the winner, because of white-black column exchange
swap <- function(winner='draw') {
    if(winner=='black'){
        gets = 'white'
    } else if(winner=='white'){
        gets = 'black'
    } else {
        return (winner)
    }
    return(gets)
}

The code executes for about 5 minutes, to handle all nodes. I think that this is happening, mainly because I iterate for each node. Maybe I should use something like map, but I am not so sure. Thank you.

Hussain Alsalman · Accepted Answer · 2021-04-19 00:15:18Z

General tips

Avoid defining your own loops. This is especially true with R because the chances are that there is another more elegant way to solve the problem using highly optimized functions.
Avoid rbind function inside loops especially if your are using it as an appending mechanism. This function creates a new empty data frame with a number of rows equal to number or rows of the old plus the number or rows of the new. Then it will copy the old and the new into this newly created data frame. This operation can be really expensive if you have large dataset. An optimizing solution can be achieved by creating an empty data frame with right size once outside the loop and populate it with the results using i index e.g all_results[i] = new_results inside the loop.
Avoid subset inside loops. The function documentation states that this is only a convienet function and should only be used in interactive setting.It also recommends that you use [ for subsetting.

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Moreover, it has the same issue as in the tip #2 since you are writing a new data frame minus the processed row in every go inside the loop.

Alternative solution

The problem can be solved using much less code than provided. Also it will only involves where the real data is the per_game table.

First you need to breakdown winner column in 3 for each winner. Since you seem familiar with the tidyverse family packages I will use functions from these packages.

library("tidyr")
library("dplyr")

per_game %>% 
# Here we breakdown the winner columns into 3 using `spread()` function from tidyr
spread(winner, total_matches) %>% 
# Then we add the black and white columns into new column called `victories`
rowwise() %>% # to ensure summation excute on per row. 
mutate(victories  = sum(black, white,na.rm = TRUE))

# the results should look like this 

# Rowwise: 
   white_id           black_id           black  draw white victories
   <chr>              <chr>              <dbl> <dbl> <dbl>     <dbl>
 1 --jim--            "voodooo"              1    NA    NA         1
 2 -1-jedi_knighl_-1- "erik123678"          NA    NA     1         1
 3 -1-jedi_knighl_-1- "kaY\\ran0098"         2    NA     2         4
 4 -1-jedi_knighl_-1- "pav1ngt"              1    NA    NA         1
 5 -mati-             "astronavtearl101"    NA    NA     1         1
 6 -pavel-            "hishamtheman"        NA    NA     1         1
 7 1063314            "dccbc,ss"            NA    NA     1         1
 8 1111112222         "crusova_ 33"          1    NA    NA         1
 9 1111112222         "steelviper"           1    NA    NA         1
10 1240100948         "aa22bb"               1    NA    NA         1

Stack Exchange Network

Calculating edges of a social network

1 Answer 1

General tips

Warning

Alternative solution

You must log in to answer this question.

Hot Network Questions

1 Answer 1

General tips

Alternative solution

You must log in to answer this question.

Related