string replace using a List<string>

Question

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

But it's not one line @htw. you don't get any geek points if its not one line. — George Mauer
– George Mauer, Commented Sep 14, 2010 at 19:56
Don't let this program run in Charlotte, NC. Prominent road names happen to be East Blvd, South Blvd, West Blvd. Those are the names of the roads, not a differentiation of now you're on West 1st Street. On that note, there are other scenarios where your directions aren't really directions, but key parts of the identifier. Northampton, Northlake (mall/area in Charlotte), North Carolina, North Dakota, etc. — Anthony Pegram
– Anthony Pegram, Commented Sep 14, 2010 at 19:57
@Anthony : This is true, I will be careful with what I put in my dictionary. However, I match with postal code (zip) first which must match exactly for the function to even consider the addresses. From there, I don't really mind if I'd rather get false positives then to miss results. — Hugo Migneron
– Hugo Migneron, Commented Sep 14, 2010 at 20:06
Then you will be pleased to know that East, West, and South Blvds all intersect! They will share a zip! I'm convinced if you can get your program to run in Charlotte, you can get it to run anywhere. — Anthony Pegram
– Anthony Pegram, Commented Sep 14, 2010 at 20:13
And Canada is totally free of North/South streets/boulevards? I think Anthony's comment was a lot more generic than your problem statement. — Henk Holterman
– Henk Holterman, Commented Sep 14, 2010 at 20:37

Gabe · Accepted Answer · 2010-09-14 20:00:06Z

14

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

edited Sep 14, 2010 at 20:00

answered Sep 14, 2010 at 19:54

Gabe

87.1k13 gold badges144 silver badges238 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

AHM Over a year ago

This is a great solution, both shorter and clearer than the regex versions.

Kobi Over a year ago

You might as well split by the words - text.Split(ignoreList.ToArray(), StringSplitOptions.None). That said, it is easier to adapt your approach to ignore case.

Mark Byers Over a year ago

What about punctuation before or after words?

Gabe Over a year ago

Kobi: text.Split(ignoreList.ToArray()) doesn't work for the same reason all the string.Replace methods don't work.

Gabe Over a year ago

Mark: Presumably he would want to consider punctuation to be word-breakers. It's up to him, but I'd guess he'd want text.Split(new[]{' ','.',',','-'}) but he can tweak it to support whatever algorithm he has.

|

Bob · Accepted Answer · 2010-09-14 20:12:11Z

6

Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);

edited Sep 14, 2010 at 20:12

answered Sep 14, 2010 at 19:50

Bob

3,3511 gold badge18 silver badges11 bronze badges

4 Comments

Frank Schwieterman Over a year ago

if there are special characters, you should escape the stuff in ignoreList: string.Join("|", ignoreList.select(s => Regex.Escape(s)).ToArray())

Gabe Over a year ago

Since odds are the list will contain words like "St.", escaping is advised. And you have to look only for whole words.

Bob Over a year ago

@Frank Correct . . . though it isn't really specified where the list comes from. It would probably be easiest to just write the correct regular expression in the first place rather than to convert it from a list, unless the list is really necessary.

Frank Schwieterman Over a year ago

Yeah, building a Regex dynamically is only really worthwhile if the list contents might change. Using a Regex in general is only useful if this function is used alot as its potentially faster then N string replacements.

George Mauer · Accepted Answer · 2010-09-14 19:47:46Z

5

Something like this should work:

string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
  return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}

answered Sep 14, 2010 at 19:47

George Mauer

123k140 gold badges400 silver badges632 bronze badges

4 Comments

George Mauer Over a year ago

I might have swapped around the parameters to the second lambda but this will definitely work, Aggregate is an incredibly powerful method, its lame people don't use it very often

George Mauer Over a year ago

It should be noted that I doubt that calling Replace multiple times is not the most preformant way of doing this. Probably something where you build the contents of the list into a static RegEx and use that to replace would be faster, but I suspect the difference won't matter in this case.

Gabe Over a year ago

This is not correct because it uses string.Replace which can't match only on a word boundary. If you're going to use a RegEx, though, it should use a single compiled one.

George Mauer Over a year ago

Good point @Gabe the example is more about the usage of Aggregate than of Replace.

Albin Sunnanbo · Accepted Answer · 2010-09-14 19:48:22Z

3

What's wrong with a simple for loop?

string street = "14th Avenue North";
foreach (string word in ignoreList)
{
    street = street.Replace(word, string.Empty);
}

answered Sep 14, 2010 at 19:48

Albin Sunnanbo

47.1k8 gold badges72 silver badges110 bronze badges

Comments

Mark Byers · Accepted Answer · 2010-09-14 20:59:39Z

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:

string s = "14th Avenue North";
Regex regex = new Regex(string.Format(@"\b({0})\b",
                        string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");

Result:

14th Avenue

If there are special characters you will need to fix two things:

Use Regex.Escape on each element of ignore list.
The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.

Here's how to fix these two problems:

Regex regex = new Regex(string.Format(@"(?<= |^)({0})(?= |$)",
    string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));

It's a pretty good bet that his words will need escaping, because they'll be like "St.", "Blvd.", "Rd."
That's a great way to handle the space problem raised in another comment.
This is very clever and it seems like it would work on all the words. I will write some tests for it and try it out properly.

Community · Accepted Answer · 2017-05-23 12:01:29Z

1

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:

address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));

If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Sep 14, 2010 at 19:53

Guffa

703k111 gold badges760 silver badges1k bronze badges

1 Comment

Hugo Migneron Over a year ago

Thanks a lot for that. My ignore list will obviously be much longer than what I posted here, but not sure if it will be long enough to use your method. I will profile it and see though.

Phil Gilmore · Accepted Answer · 2010-09-14 21:30:38Z

1

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.

List<string> ignoreList = new List<string>()
{
    "North",
    "South",
    "East",
    "West"
};    

string s = "123 West 5th St"
        .Split(' ')  // Separate the words to an array
        .ToList()    // Convert array to TList<>
        .Except(ignoreList) // Remove ignored keywords
        .Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string

answered Sep 14, 2010 at 21:30

Phil Gilmore

1,3068 silver badges15 bronze badges

1 Comment

Gabe Over a year ago

The .ToList() is unnecessary.

Damian Leszczyński - Vash · Accepted Answer · 2010-09-14 19:52:35Z

0

Why not juts Keep It Simple ?

public static string Trim(string text)
{
   var rv = text.trim();
   foreach (var ignore in ignoreList) {
      if(tv.EndsWith(ignore) {
      rv = rv.Replace(ignore, string.Empty);
   }
  }
   return rv;
}

answered Sep 14, 2010 at 19:52

Damian Leszczyński - Vash

31k9 gold badges64 silver badges95 bronze badges

Comments

Øyvind Bråthen · Accepted Answer · 2010-09-14 19:58:20Z

0

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:

string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "

answered Sep 14, 2010 at 19:58

Øyvind Bråthen

60.9k28 gold badges128 silver badges154 bronze badges

Comments

Gabe · Accepted Answer · 2010-09-14 22:27:38Z

0

public static string Trim(string text)
{
   var rv = text;
   foreach (var ignore in ignoreList)
      rv = rv.Replace(ignore, "");
   return rv;
}

Updated For Gabe

public static string Trim(string text)
{
   var rv = "";
   var words = text.Split(" ");
   foreach (var word in words)
   {
      var present = false;
      foreach (var ignore in ignoreList)
         if (word == ignore)
            present = true;
      if (!present)
         rv += word;
   }
   return rv;
}

edited Sep 14, 2010 at 22:27

Gabe

87.1k13 gold badges144 silver badges238 bronze badges

answered Sep 14, 2010 at 19:47

Umair A.

6,95220 gold badges87 silver badges134 bronze badges

3 Comments

Steven Sudit Over a year ago

No LINQ, not RegExp, yet it's correct. Only thing I'd change is the use of an empty string literal.

Gabe Over a year ago

No, not correct. This will turn "123 Northampton" into "123 ampton".

Gabe Over a year ago

Close...now you need to make sure that you put back the space between words.

Brad · Accepted Answer · 2010-09-15 16:28:13Z

0

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.

Here's a start:

(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)

If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

edited Sep 15, 2010 at 16:28

answered Sep 14, 2010 at 19:48

Brad

15.7k6 gold badges40 silver badges58 bronze badges

3 Comments

Steven Sudit Over a year ago

I guess we could. Do we really want to, though?

Gabe Over a year ago

This is a good start. Now make it so that it only matches whole words.

Brad Over a year ago

We used this approach to flag a huge list of customers as business or residential based on RegEx keywords generated from looking at the data.

Collectives™ on Stack Overflow

string replace using a List<string>

11 Answers 11

8 Comments

4 Comments

4 Comments

Comments

3 Comments

1 Comment

1 Comment

Comments

Comments

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

8 Comments

4 Comments

4 Comments

Comments

3 Comments

1 Comment

1 Comment

Comments

Comments

3 Comments

3 Comments

Linked

Related