17

I have a dataframe (df) with a column (Col2) like this:

Col1                 Col2                   Col3
  1   C607989_booboobear_Nation               A
  2   C607989_booboobear_Nation               B
  3   C607989_booboobear_Nation               C
  4   C607989_booboobear_Nation               D
  5   C607989_booboobear_Nation               E
  6   C607989_booboobear_Nation               F

I want to extract just the number in Col2

Col1              Col2                    Col3
  1              607989                     A
  2              607989                     B
  3              607989                     C
  4              607989                     D
  5              607989                     E
  6              607989                     F

I have tried things like:

gsub("^.*?_","_",df$Col2)

but it's not working.

0

3 Answers 3

14

If your string is not too fancy/complex, it might be easiest to do something like:

gsub("C([0-9]+)_.*", "\\1", df$Col2)
# [1] "607989" "607989" "607989" "607989" "607989" "607989"

Start with a "C", followed by digits, followed by an underscore and then anything else. Digits are captured with (), and the replacement is set to that capture group (\\1).

Sign up to request clarification or add additional context in comments.

1 Comment

@Ananda Mahto: could be a bit easier for someone to decipher if the call were dat2 <- gsub("_.*$", "", gsub("^C", "", dat$Col2)). Remove from underscore on to end, then remove C at the start.
3

An alternate approach using qdap::genXtract that grabs strings between a left and right boundary. Here I use C and _ for the left and right bounds:

## Your data in a better form for sharing
dat <- structure(list(Col1 = c("1", "2", "3", "4", "5", "6"), Col2 = c("C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation", "C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation"), Col3 = c("A", 
    "B", "C", "D", "E", "F")), .Names = c("Col1", "Col2", "Col3"), row.names = c(NA, 
    -6L), class = "data.frame")

library(qdap)
dat[[2]] <- unlist(genXtract(dat[[2]], "C", "_"))
dat

##   Col1   Col2 Col3
## 1    1 607989    A
## 2    2 607989    B
## 3    3 607989    C
## 4    4 607989    D
## 5    5 607989    E
## 6    6 607989    F

Comments

3

Or, you could use regex lookbehind

library(stringr)
 str_extract(dat$Col2, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989"

(?<=[A-Z]) Matches if the searched substring is preceded by a match for a capital letter of fixed length. In this case it is 1.

\\d+ the pattern/substring to be extracted are digits.

In the strings, this occurs only at C607989_booboobear_Nation. So, it extracts only the digits that follows that pattern

Suppose you have a string like this:

 v1 <- c(dat$Col2, "booboobear_D600078_Nation")
 str_extract(v1, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989" "600078" 

still gets the number

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.