Filtering Rows matching String condition in R

Question

I`ve got some problems filtering for duplicate elements in a string. My data look similar to this:

idvisit     path
1           1,16,23,59,16
2           2,14,19,14
3           5,19,23
4           10,21
5           23,27,29,23

I have a column containing an unique ID and a column containing a path for web page navigation. The right column contains some cases, where pages were accessed twice or more often, but some different pages are between these accesses. I just want to filter() the rows, where pages occur twice or more often and at least one page is in bettween the two accesses, so the data should look like this.

idvisit     path
1           1,16,23,59,16
2           2,14,19,14
5           23,27,29,23

I just want to remove the rows that match the conditions. I really dont know how to handle a String with using a variable for the many different numbers.

Sotos · Accepted Answer · 2017-02-22 10:54:45Z

1

You can filter based on the number of elements in each string. Strings with duplicated entries will be larger than their unique lengths, i.e.

df1[sapply(strsplit(as.character(df1$path), ','), function(i) length(unique(i)) != length(i)),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23

answered Feb 22, 2017 at 10:54

Sotos

51.6k6 gold badges35 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sebastian Ettner Over a year ago

That works perfect! Thanks a lot :) Do you have any useful link, that i can learn more about this string-stuff by myself? i searched the web a lot but did not find any useful hints for my problem?

Sotos Over a year ago

Well, most of waht I learned, I learned from here to be honest

akrun · Accepted Answer · 2017-02-22 11:06:01Z

1

We can try

library(data.table)
lst <- strsplit(df1$path, ",")
df1[lengths(lst) != sapply(lst, uniqueN),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23

Or an option using tidyverse

library(tidyverse)
separate_rows(df1, path) %>% 
     group_by(idvisit) %>% 
     filter(n_distinct(path) != n()) %>% 
     summarise(path = toString(path))

edited Feb 22, 2017 at 11:06

answered Feb 22, 2017 at 10:45

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

ira Over a year ago

That doesn't take care of the situation when there is same page twice, but no other page in between

akrun Over a year ago

@ira Updated the post

Sandipan Dey · Accepted Answer · 2017-02-22 11:05:56Z

0

You could try regular expressions too with grepl:

df[grepl('.*([0-9]+),.*,\\1', as.character(df$path)),]
#  idvisit          path
#1       1 1,16,23,59,16
#2       2    2,14,19,14
#5       5   23,27,29,23

answered Feb 22, 2017 at 11:05

Sandipan Dey

23.4k4 gold badges59 silver badges72 bronze badges

Collectives™ on Stack Overflow

Filtering Rows matching String condition in R

3 Answers 3

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Related