0

I`ve got some problems filtering for duplicate elements in a string. My data look similar to this:

idvisit     path
1           1,16,23,59,16
2           2,14,19,14
3           5,19,23
4           10,21
5           23,27,29,23

I have a column containing an unique ID and a column containing a path for web page navigation. The right column contains some cases, where pages were accessed twice or more often, but some different pages are between these accesses. I just want to filter() the rows, where pages occur twice or more often and at least one page is in bettween the two accesses, so the data should look like this.

idvisit     path
1           1,16,23,59,16
2           2,14,19,14
5           23,27,29,23

I just want to remove the rows that match the conditions. I really dont know how to handle a String with using a variable for the many different numbers.

3 Answers 3

1

You can filter based on the number of elements in each string. Strings with duplicated entries will be larger than their unique lengths, i.e.

df1[sapply(strsplit(as.character(df1$path), ','), function(i) length(unique(i)) != length(i)),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23
Sign up to request clarification or add additional context in comments.

2 Comments

That works perfect! Thanks a lot :) Do you have any useful link, that i can learn more about this string-stuff by myself? i searched the web a lot but did not find any useful hints for my problem?
Well, most of waht I learned, I learned from here to be honest
1

We can try

library(data.table)
lst <- strsplit(df1$path, ",")
df1[lengths(lst) != sapply(lst, uniqueN),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23

Or an option using tidyverse

library(tidyverse)
separate_rows(df1, path) %>% 
     group_by(idvisit) %>% 
     filter(n_distinct(path) != n()) %>% 
     summarise(path = toString(path))

2 Comments

That doesn't take care of the situation when there is same page twice, but no other page in between
@ira Updated the post
0

You could try regular expressions too with grepl:

df[grepl('.*([0-9]+),.*,\\1', as.character(df$path)),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.