14

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

7
  • 1
    But it's not one line @htw. you don't get any geek points if its not one line. Commented Sep 14, 2010 at 19:56
  • 8
    Don't let this program run in Charlotte, NC. Prominent road names happen to be East Blvd, South Blvd, West Blvd. Those are the names of the roads, not a differentiation of now you're on West 1st Street. On that note, there are other scenarios where your directions aren't really directions, but key parts of the identifier. Northampton, Northlake (mall/area in Charlotte), North Carolina, North Dakota, etc. Commented Sep 14, 2010 at 19:57
  • @Anthony : This is true, I will be careful with what I put in my dictionary. However, I match with postal code (zip) first which must match exactly for the function to even consider the addresses. From there, I don't really mind if I'd rather get false positives then to miss results. Commented Sep 14, 2010 at 20:06
  • Then you will be pleased to know that East, West, and South Blvds all intersect! They will share a zip! I'm convinced if you can get your program to run in Charlotte, you can get it to run anywhere. Commented Sep 14, 2010 at 20:13
  • 1
    And Canada is totally free of North/South streets/boulevards? I think Anthony's comment was a lot more generic than your problem statement. Commented Sep 14, 2010 at 20:37

11 Answers 11

14

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

Sign up to request clarification or add additional context in comments.

8 Comments

This is a great solution, both shorter and clearer than the regex versions.
You might as well split by the words - text.Split(ignoreList.ToArray(), StringSplitOptions.None). That said, it is easier to adapt your approach to ignore case.
What about punctuation before or after words?
Kobi: text.Split(ignoreList.ToArray()) doesn't work for the same reason all the string.Replace methods don't work.
Mark: Presumably he would want to consider punctuation to be word-breakers. It's up to him, but I'd guess he'd want text.Split(new[]{' ','.',',','-'}) but he can tweak it to support whatever algorithm he has.
|
6
Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);

4 Comments

if there are special characters, you should escape the stuff in ignoreList: string.Join("|", ignoreList.select(s => Regex.Escape(s)).ToArray())
Since odds are the list will contain words like "St.", escaping is advised. And you have to look only for whole words.
@Frank Correct . . . though it isn't really specified where the list comes from. It would probably be easiest to just write the correct regular expression in the first place rather than to convert it from a list, unless the list is really necessary.
Yeah, building a Regex dynamically is only really worthwhile if the list contents might change. Using a Regex in general is only useful if this function is used alot as its potentially faster then N string replacements.
5

Something like this should work:

string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
  return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}

4 Comments

I might have swapped around the parameters to the second lambda but this will definitely work, Aggregate is an incredibly powerful method, its lame people don't use it very often
It should be noted that I doubt that calling Replace multiple times is not the most preformant way of doing this. Probably something where you build the contents of the list into a static RegEx and use that to replace would be faster, but I suspect the difference won't matter in this case.
This is not correct because it uses string.Replace which can't match only on a word boundary. If you're going to use a RegEx, though, it should use a single compiled one.
Good point @Gabe the example is more about the usage of Aggregate than of Replace.
3

What's wrong with a simple for loop?

string street = "14th Avenue North";
foreach (string word in ignoreList)
{
    street = street.Replace(word, string.Empty);
}

Comments

2

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:

string s = "14th Avenue North";
Regex regex = new Regex(string.Format(@"\b({0})\b",
                        string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");

Result:

14th Avenue 

If there are special characters you will need to fix two things:

  • Use Regex.Escape on each element of ignore list.
  • The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.

Here's how to fix these two problems:

Regex regex = new Regex(string.Format(@"(?<= |^)({0})(?= |$)",
    string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));

3 Comments

It's a pretty good bet that his words will need escaping, because they'll be like "St.", "Blvd.", "Rd."
That's a great way to handle the space problem raised in another comment.
This is very clever and it seems like it would work on all the words. I will write some tests for it and try it out properly.
1

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:

address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));

If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

1 Comment

Thanks a lot for that. My ignore list will obviously be much longer than what I posted here, but not sure if it will be long enough to use your method. I will profile it and see though.
1

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.

List<string> ignoreList = new List<string>()
{
    "North",
    "South",
    "East",
    "West"
};    

string s = "123 West 5th St"
        .Split(' ')  // Separate the words to an array
        .ToList()    // Convert array to TList<>
        .Except(ignoreList) // Remove ignored keywords
        .Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string

1 Comment

The .ToList() is unnecessary.
0

Why not juts Keep It Simple ?

public static string Trim(string text)
{
   var rv = text.trim();
   foreach (var ignore in ignoreList) {
      if(tv.EndsWith(ignore) {
      rv = rv.Replace(ignore, string.Empty);
   }
  }
   return rv;
}

Comments

0

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:

string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "

Comments

0
public static string Trim(string text)
{
   var rv = text;
   foreach (var ignore in ignoreList)
      rv = rv.Replace(ignore, "");
   return rv;
}

Updated For Gabe


public static string Trim(string text)
{
   var rv = "";
   var words = text.Split(" ");
   foreach (var word in words)
   {
      var present = false;
      foreach (var ignore in ignoreList)
         if (word == ignore)
            present = true;
      if (!present)
         rv += word;
   }
   return rv;
}

3 Comments

No LINQ, not RegExp, yet it's correct. Only thing I'd change is the use of an empty string literal.
No, not correct. This will turn "123 Northampton" into "123 ampton".
Close...now you need to make sure that you put back the space between words.
0

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.

Here's a start:

(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)

If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

3 Comments

I guess we could. Do we really want to, though?
This is a good start. Now make it so that it only matches whole words.
We used this approach to flag a huge list of customers as business or residential based on RegEx keywords generated from looking at the data.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.