1

I have a general question. I try to do string matching between data frames in R. My strings have the format below:

"COOL FOODS LTD 222 HIGH ST LONDON ABC123"  

I would like to iterate over other data frames and would like my code to find matches between the above string and the strings below:

"222 HIGH ST LONDON ABC123 COOL FOODS LTD " 
"HIGH LTD ST 222 LONDON COOL ABC123 FOODS "
"COOL FOODS LTD 222 HIGH ST LONDON UNITED KINGDOM ABC123"

I tried adist, but the similarity scores I get using that method are not very good when parts of the string are rearranged or when the inserted part is long (as per the examples).

I thought about splitting my strings by white spaces, but I'm not sure how to then do the matching and comparing efficiently with many data frames.

I would be grateful for any suggestions!

Cheers!

2
  • 1
    You can try the stringdist package. Maybe the function amatch contains a suitable method. If you'd like to split your string by whitespaces, you could then use something like mean(wordsOfString1 %in% wordsOfString2) Commented Aug 27, 2018 at 10:44
  • 1
    This question may help you. Commented Aug 27, 2018 at 10:56

1 Answer 1

1

Using package stringdist you can write a helper function that compares a string to each target string in a vector.
The function below first strplit's and sort's all strings. Then calls stringsim to compute a similarity score.

funSimilarity <- function(x, y, method = "osa"){
    x <- strsplit(x, " ")[[1]]
    x <- paste(sort(x), collapse = " ")
    y_list <- strsplit(y, " ")
    y_list <- lapply(y_list, function(.y) paste(sort(.y), collapse = " "))
    stringsim(x, unlist(y_list), method = method)
}

funSimilarity(x, y)
#[1] 1.0000000 1.0000000 0.7272727

met <- c("osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex")

sapply(met, function(m) funSimilarity(x, y, method = m))
#           osa        lv        dl hamming       lcs     qgram    cosine
#[1,] 1.0000000 1.0000000 1.0000000       1 1.0000000 1.0000000 1.0000000
#[2,] 1.0000000 1.0000000 1.0000000       1 1.0000000 1.0000000 1.0000000
#[3,] 0.7272727 0.7272727 0.7272727       0 0.8421053 0.8421053 0.9689541
#       jaccard        jw soundex
#[1,] 1.0000000 1.0000000       1
#[2,] 1.0000000 1.0000000       1
#[3,] 0.8095238 0.8632576       1
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.