0

I am reading external data using read.table() in R like:

student_record <- read.table("Address of data",fill = TRUE,col.names=c("student_id","name"))

Student id is a 20 character long string of the format say STU01000001010001001 and I want to keep rows where student id satisfy following conditions:

   ( 0 – 2 = STU) AND
(5 – 9  != 11111) AND
(10 – 11 != (00 or 10)) AND
(12 – 17  != 111111) AND
(18-19 = 04)

Here 0,2 and so on represent index of character in student id. How can I filter out records using such filter conditions?

I executed this after read.table() to filter:

stu_record <- student_record[grepl("^STU.{2}(?!11111).(?!(00|10)).(?!111111).04", student_record[,1], perl=T),]

but the output doesn't seems to come correct because everything gets filtered out and I get an empty frame

When I executed this:

stu_record <- student_record[grepl("^STU.{2}(?!11111).(?!(00|10)).(?!111111)04", student_record[,1], perl=T),]  

then I see records but they don't seems to be correct as I can see records like STU13120600500000002 which should not come as last two index should be 04

UPDATE: few rows that I see after executing above command are(The ids dont get filtered correctly as las two digits should be 04 but I see 01):

       student_id         Name    
  "STU01115000000000001"  "A"   
  "STU01115000000000001"  "B"   
  "STU01115000000000001"  "C"   
  "STU01115000000000001"  "D"   
  "STU01115000000000001"  "E"   
  "STU01115000000000001"  "F"   
  "STU01115000000000001"  "G"   
  "STU01115000000000001"  "H"   
  "STU01115000000000001"  "I"

while some of the ids which should have been there but got filtered out are:

      "STU01155000000000004"  "F"   
      "STU01135000000000004"  "G"   
      "STU01145000000000004"  "H"   
      "STU01125000000000004"  "I"

NOTE: There are certain index in string for which there is no condition like for index 3 and 4 there is no filtering condition so they can be anything.

3 Answers 3

2

This should work. I made up a test string.

string <- c("STU0100010", "STU0100010", "STU0300010", "STU0100090")

grepl("^STU(?!01).*(?!01|90)$", string, perl = T)
[1] FALSE FALSE  TRUE FALSE

The grep function looks for strings in the vector that start with STU, but are not followed by 01 (using a negative lookahead assertion) and there is not an 01 or 90 at the end (another negative lookahead and the end of string anchor).

Sign up to request clarification or add additional context in comments.

Comments

1

Using df from @digEmAll

df[grepl("^STU.(?!01).{2}(?!(01|90))", df[,1], perl=T),]
#    student_id name
#1 STUx1000xx    A
#3 STU01008bb    C

2 Comments

I did some modification to the post above based on your answer but the output doesn't seems to come correct.Can you take a look at it. Also instead of 10 character string I have now 20 character string.
@user2966197. In think in the newly updated df. substr(df[,1], 11,12)#[1] "00" "00" "00" "00" "00" "00" "00" "00" "00". So, it won't match with none of the elements for the third condition.
1

You can use substr function :

# example data
df <- 
data.frame(
student_id=c('STUx1000xx','STU00110xx','STU01008bb','STU01090aa'),
name=c('A','B','C','D'),stringsAsFactors=F)

# > df
#   student_id name
# 1 STUx1000xx    A
# 2 STU00110xx    B
# 3 STU01008bb    C
# 4 STU01090aa    D

# create filter using substr function
condition <- substr(df$student_id,1,3) == 'STU' &
             substr(df$student_id,5,6) != '01' &
             substr(df$student_id,7,8) != '01' &
             substr(df$student_id,7,8) != '90' 

filtered <- df[condition,]

# > filtered
#   student_id name
# 1 STUx1000xx    A
# 3 STU01008bb    C

EDIT :

the new condition should be :

condition <- substr(df$student_id,1,3) == 'STU' &
             substr(df$student_id,6,10) != '11111' &
             substr(df$student_id,11,12) != '00' &
             substr(df$student_id,11,12) != '10' &
             substr(df$student_id,13,18) != '111111' &
             substr(df$student_id,19,20) == '04'

9 Comments

There is some modification in the post above regarding string length and conditions. I tried your way for these new conditions but they seems to return an empty frame(delete all records). I had executed condition <- substr(student_record$student_id,1,3) == 'STU' & substr(student_record$student_id,6,10) != '11111' & substr(student_record$student_id,11,12) != '00' & substr(student_record$student_id,11,12) == '10' & substr(student_record$student_id,13,18) == '111111' & substr(data_emp_mat$series_id,19,20) == '04'
@user2966197: please modify your original post and add a small (say 10 rows) example of your data.frame and tell what's your expected result and what you actually get...
I added some data that I am getting after executing the command I have mentioned in main post
@user2966197: mmh... maybe I wasn't clear enough... you should post some lines of your original data.frame (say 3 lines that you expect to be filtered and 3 that you expected to pass the condition) and then write your expected result of the filtering process on this sample data...
I have mentioned some data which should have been there but got filtered out.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.