0

I'm a little too new to RegEx's so this is mostly asking for help with specific pattern matching and a little with how to implement them in C#.

I have a large Excel file full of, amon other things, repeated addresses that are written in different styles. Most are abbreviations of words like Avenue/etc.

For the simple ones I looked up the string.replace() function:

address.Replace("Av ", "Av. ");

And it does the trick there and for some others; but what if I want to replace the word "Ave" I run into the possibility of it being part of another word (some addresses are in Spanish so this is likely to happen). I thought about including whitespaces before and after (" ave ") but would that work if it's the first word in the string? Or should I use a pattern like (this might be wrong too)

^[0-9a-zA-Z_#' ](Ave)\w //the word is **not** preceded by any character other than a whitespace and is followed by a whitespace

For Expressions such as those, I should use something along this pattern, right?

string replacement = "Av.";
Regex rgx = new Regex( ^[0-9a-zA-Z_#' ](Ave)\w);
string result = rgx.Replace(input, replacement);

Thanks

2 Answers 2

3

Regular expressions have a nifty tool for this which is the \b character class shortcut, it matches on word boundaries, so Ave\b would only match Ave followed by either a space or a dot or something else that is not a word character.

Read all about the word boundary class here: http://www.regular-expressions.info/wordboundaries.html

BTW, that site is THE place to go to to learn about regular expressions.

Also, if you were to do it in the way you try, it could be something like this: [^\w]Ave[^\s]

That literally is: Not a word character (a-z, A-Z, 0-9 or _), then Ave, then not a space character (tab, space, linebreak etc.).

Also you could use the shorthand for [^\w] and [^\s] which are \W and \S so it would then become \WAve\S

But the \b way is better.

Sign up to request clarification or add additional context in comments.

3 Comments

I found that, but it seems to check for ending boundaries. Could I use it for boundaries on the beginning of the word? That is to say to have it check if the word starts with "Ave" and has no preceding characters?
One more question, I'm getting some trailing periods and I try to get rid of them using rgx = new Regex(@"\s\.\b|\b\."); but it doesn't work. What should I do? (eg. trying to get rid of the extra . in "Av. . "
OK, so if I understand correctly you want to replace all instances of Av, Ave, Avenue, Av., Ave. and Avenue. with Av.? In that case \bAv(e|enue)?\.?+ should cover all the cases, or if you want to go more general to cover spelling errors etc. you could do \bAv\w+\.? (However that would also match Aviation). If that doesn't work exactly and you simply want to get rid of the erroneous dots later, why not replace \s*\.\s*\. with a single .. This will get rid of all "..", " . ." and ". ."'s. (Or if you only have that problem specifically you can replace \.\s\. with a single dot)
1

Add the word delimiter to your regex,

Regex.Match(content, @"\b(Ave)\b");

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.