3

My question is What's the fastest (quality is important too, but a little less important) way to compare two strings?

I'm looking for the most efficient way to compare two strings. Some of the strings I'm comparing can be over 5000 characters long. I'm comparing a list of about 80 strings with another list of about 200 strings. It takes forever, even when I'm threading it. I'm using the StringUtils.getLevenshteinDistance(String s, String t) method from Apache Commons. My method follows. Is there a better way to do this?

private void compareMe() {
  List<String> compareStrings = MainController.getInstance().getCompareStrings();
  for (String compare : compareStrings) {
    int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
    if (bestScore > levenshteinDistance
          && levenshteinDistance > -1) {
      bestScore = levenshteinDistance; //global variable
      bestString = compare; //global variable
    }
  }
}

Here's a sample of two strings which should have a good score:

String 1:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')

String 2:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')

You'll notice the only difference is the '${request.corp_vendor_id};' at the end of the string. This would cause it to have a score of 26 from the LevenshteinDistance method.

3
  • 2
    Define what you mean by "compare". "Compare" usually means ==/!= or >/==/<, but since you're using a distance function you obviously don't want a binary comparison. Commented May 24, 2012 at 18:55
  • 1
    Without any knowledge about the content of the strings, there is no real optimization possible (other to avoid to compare A B AND B A) Commented May 24, 2012 at 18:58
  • About all you can do is get the source for your compare method and see if you can "tighten it up" somehow. Any "distance" compare is going to be expensive. If you only need a "go/no-go" result, though (instead of the actual score in all cases), you could probably use some pre-conditioning tests, based on the specifics of your data. Commented May 24, 2012 at 19:21

1 Answer 1

3

You should think about possible shortcuts in your comparison logic to avoid some calculations at all. So if you want to minimize the Levensthein distance globally, you do not even need to calculate it, if the difference of the string sizes is higher than your current best Levenshtein distance.

E.g. if your current best Levenshtein distance is 50, then you can avoid comparing two strings of size 100 and 180, because their Levenshtein distance is at least 80.

Sign up to request clarification or add additional context in comments.

1 Comment

Just so you know. This about quadrupled the speed of my method. Thanks a ton!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.