Remove part of a string in dataframe column (R)

Question

I have a dataframe (df) with a column (Col2) like this:

Col1                 Col2                   Col3
  1   C607989_booboobear_Nation               A
  2   C607989_booboobear_Nation               B
  3   C607989_booboobear_Nation               C
  4   C607989_booboobear_Nation               D
  5   C607989_booboobear_Nation               E
  6   C607989_booboobear_Nation               F

I want to extract just the number in Col2

Col1              Col2                    Col3
  1              607989                     A
  2              607989                     B
  3              607989                     C
  4              607989                     D
  5              607989                     E
  6              607989                     F

I have tried things like:

gsub("^.*?_","_",df$Col2)

but it's not working.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-08-13 02:35:34Z

14

If your string is not too fancy/complex, it might be easiest to do something like:

gsub("C([0-9]+)_.*", "\\1", df$Col2)
# [1] "607989" "607989" "607989" "607989" "607989" "607989"

Start with a "C", followed by digits, followed by an underscore and then anything else. Digits are captured with (), and the replacement is set to that capture group (\\1).

answered Aug 13, 2014 at 2:35

A5C1D2H2I1M1N2O1R2T1

194k30 gold badges415 silver badges496 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lawyeR Over a year ago

@Ananda Mahto: could be a bit easier for someone to decipher if the call were dat2 <- gsub("_.*$", "", gsub("^C", "", dat$Col2)). Remove from underscore on to end, then remove C at the start.

Tyler Rinker · Accepted Answer · 2014-08-13 03:11:19Z

An alternate approach using qdap::genXtract that grabs strings between a left and right boundary. Here I use C and _ for the left and right bounds:

## Your data in a better form for sharing
dat <- structure(list(Col1 = c("1", "2", "3", "4", "5", "6"), Col2 = c("C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation", "C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation"), Col3 = c("A", 
    "B", "C", "D", "E", "F")), .Names = c("Col1", "Col2", "Col3"), row.names = c(NA, 
    -6L), class = "data.frame")

library(qdap)
dat[[2]] <- unlist(genXtract(dat[[2]], "C", "_"))
dat

##   Col1   Col2 Col3
## 1    1 607989    A
## 2    2 607989    B
## 3    3 607989    C
## 4    4 607989    D
## 5    5 607989    E
## 6    6 607989    F

akrun · Accepted Answer · 2014-08-13 08:11:46Z

Or, you could use regex lookbehind

library(stringr)
 str_extract(dat$Col2, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989"

(?<=[A-Z]) Matches if the searched substring is preceded by a match for a capital letter of fixed length. In this case it is 1.

\\d+ the pattern/substring to be extracted are digits.

In the strings, this occurs only at C607989_booboobear_Nation. So, it extracts only the digits that follows that pattern

Suppose you have a string like this:

 v1 <- c(dat$Col2, "booboobear_D600078_Nation")
 str_extract(v1, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989" "600078"

still gets the number

Collectives™ on Stack Overflow

Remove part of a string in dataframe column (R)

3 Answers 3

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Linked

Related