5

I am trying to parse a file of street names for a project, and need to remove modifiers (Upper / Lower /Old / New / North / East / South / West ...) and endings (street / road / way / lane...), but I am hving no luck with a regular expression.

The way it is set up at the moment is that the program will parse the file one line (ie. street) at a time, and check it

I think the problem is word boundries - what I need for example are the following transformations...
Old Harrow Way -> Harrow (ie. remove 'Old' prefix and 'Way' ending)
Chittock Mead -> Chittock (Remove the ending 'Mead')
- But to leave these alone when in a word:
Gold Lane -> Gold (just remove ending)
Eastley Avenue -> Eastly (just remove ending)
Upper Western Avenue -> Western (remove prefix and ending)

Obviously, things like "South Street" would remove both - This is ok, because I can discard an empty string.

Can anyone give me an idea of how to do this - I've been reading up on regular expressions and trying things for hours!

5
  • What kind of format is the file? Is it CSV? Tab delimited, or simply no such format at all? Do you have reliable delimiters for the different fields? Is the file fixed space? Commented Feb 22, 2011 at 21:39
  • 2
    Ah, reminds me of the old adage: You have a problem that you decide to solve with regular expressions. Now you have two problems. :) I'm sorry I don't have a solution for you and I can only add smart-aleck comments. Good luck. Commented Feb 22, 2011 at 21:40
  • @David - I believe this was coined by Jamie Zawinski Commented Feb 22, 2011 at 21:41
  • @Oded, thanks! I never knew that. And he's a Pittsburgh guy, like me. Commented Feb 22, 2011 at 21:43
  • Regular expressions are a swine, and i haven't done them for a year so I do agree, 1 problem has become two. :) Although there are alot of good regex people here who I bet can do this in no time. Commented Feb 22, 2011 at 21:43

3 Answers 3

2

I would use a <list> or Array to store those values and then possibly a foreach loop to check the address against the list or array. You would then use .remove to remove each instance of the list or array item. There is more to this, but that is the general idea.

Sign up to request clarification or add additional context in comments.

1 Comment

@Oded - The file is just one per line: abigail close<br /> abingdon road<br /> acorn close<br /> etc
2

I'd use string.split(" ") to split the address into and array of words. Then take the first word and see it exists on a list of prefixes (ie a or Array). Do the same for the last word and the endings.

Running through two lists of reg-ex expressions for each input address will be time consuming. Using my logic should be a good deal faster, especially if the lists are sorted and b-searched.

If the address data is a bit dirty (ie, punctuation, double spaces, etc), you may want to do some cleanup, as an input string like " Main St" will have more 'words' than are really there (hint: Trim() and RegEx.Replace(" "," ")).

3 Comments

Ok, using the list method you suggested - It works like a dream! One more quickie - How would I match the 'St' at a start of a name (ie. "St. Mary's"), where it could be in the format "St. Marys's", "St Marys", and may or may not have a space after the "St[.]"? Thanks very much for your help.
Ok, got all the info I needed. Thanks again for all of your help!
I usually replace all punctuation with a space before splitting the address.
1

This question or this question will help you. Ensure that you use the Regex.Replace() method to do the pattern matching and replacement.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.