0

I have a dataframe to which I want to create a new column based on the values from another column but struggling to be able to match properly.

df1
name            code
Player 3        NA
Player 14       NA
Player 16       NA
Player 22       NA
Player 43       NA
Player 45       NA

Now I wish to fill the code column in df1 from the code column in df2 my matching on name

df2
name            id      nationality
Player 1        1       UK
Player 2        2       UK
Player 3        3       UK
Player 4        4       UK
Player 5        5       UK
Player 14       14      UK
Player 16       16      UK
Player 22       22      UK
Player 29       29      UK
Player 30       30      UK
Player 32       32      UK
Player 39       39      UK
Player 43       43      UK
Player 45       45      UK

I dont want to use merge here as df2 will be much bigger than df2 and completely separate, it would be something like; (but I cant get it correct)

df1$code = df2[match(df1$name, df2$name), 'id')
3
  • 1
    look at merge's all.x = TRUE argument. Commented Jul 6, 2023 at 13:51
  • 1
    match works for this because you are only matching on one column. merge works for this too, but generalizes up to matching on multiple columns. It doesn't matter that df2 is bigger than df1, merge() will still work just fine. Commented Jul 6, 2023 at 13:52
  • 1
    I'm marking this one as a duplicate too - please don't take that as an indicator that you should delete it. Duplicates serve as helpful pointers to the originals where thorough, updates answers live. In addition to the FAQ on joining (which works fine here), I've added the FAQ on matching. Commented Jul 6, 2023 at 14:06

3 Answers 3

2

I think this is what you need. match returns the index in the second argument of matches from the first.

df1$code = df2$id[match(df1$name, df2$name)]
Sign up to request clarification or add additional context in comments.

Comments

2

match works for this because you are only matching on one column. merge works for this too, but generalizes up to matching on multiple columns.

It doesn't matter that df2 is bigger than df1, merge() will still work just fine as long as you don't override the default and set all = TRUE - if you do that, then you will get all the rows from df2. The default is all = FALSE and you will only get rows that appear in both data frames. Here, I set all.x = TRUE to make sure you keep all rows in df1 even if they don't have matches in df2.

Because merge is more general (working for multiple columns, letting you specify whether you want to keep only rows that occur in df1 or only rows that occur in df2 or both or all), I think it is a better solution when working with data frames. match is a great function when one (or both) of your inputs are plain vectors, not data frames.

## with dplyr
library(dplyr)
df1 |> select(-code) |>
  left_join(select(df2, -id), by = "name")
#        name nationality
# 1  Player 3          UK
# 2 Player 14          UK
# 3 Player 16          UK
# 4 Player 22          UK
# 5 Player 43          UK
# 6 Player 45          UK

## with base R
df1[["code"]] = NULL
merge(df1, df2[c("name", "nationality")], all.x = TRUE)
#        name nationality
# 1 Player 14          UK
# 2 Player 16          UK
# 3 Player 22          UK
# 4  Player 3          UK
# 5 Player 43          UK
# 6 Player 45          UK

Unfortunately merge doesn't keep the row order, but you can easily re-order after.


Using this sample data:

df1 = read.table(text = 'name|code
Player 3|NA
Player 14|NA
Player 16|NA
Player 22|NA
Player 43|NA
Player 45|NA', header = T, sep = "|")

df2 = read.table(text = 'name|id|nationality
Player 1|1|UK
Player 2|2|UK
Player 3|3|UK
Player 4|4|UK
Player 5|5|UK
Player 14|14|UK
Player 16|16|UK
Player 22|22|UK
Player 29|29|UK
Player 30|30|UK
Player 32|32|UK
Player 39|39|UK
Player 43|43|UK
Player 45|45|UK
', header = TRUE, sep = "|")

Comments

-1

my answer: df1$code = df2[df2$name %in% df1$name,2]

I guess %in% here is almost the same with match. you can also try which(df2$name == df1$name) to return the rows you want

2 Comments

%in% is defined as match(x, table, nomatch = 0L) > 0L. So yes it uses match
This is risky - %in% doesn't care about order so it will only give the right answer if rows in df2 happen to be in the same order as the rows in df1 I would not recommend this as a good solution for that reason.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.