1

I am trying to use RegEx to extract a particular part of some URLs that come in different variations. Here is the generic format:

http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters

sometimes that "mip" part doesn't exist and the URL looks like this:

http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters

I started writing the following RE:

re.compile("blackpages\.com/.*")

the .* matches any character, Now, how do I stop when I encounter a "/" and extract everything that follows before the next "/" is encountered? This would give me the part I want to extract.

1
  • Rakesh, any more concerns? Please feel free to drop a line below my answer. Commented Apr 25, 2017 at 6:40

1 Answer 1

1

You need to use a negated character class:

re.compile(r"blackpages\.com/([^/]*)")
                            ^^^^

The [^/]* will match 0+ chars other than /, as many as possible (greedily).

If you expect at least one char after /, use + quantifier (1 or more occurrences) instead of *.

See the regex demo

Python code:

import re
rx = r"blackpages\.com/([^/]*)"
ss = ["http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters",
"http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters"]
for s in ss:
    m = re.search(rx, s)
    if m:
        print(m.group(1))

Output:

cityName-StateName
cityName-StateName
Sign up to request clarification or add additional context in comments.

1 Comment

Shouldn't you be using capturing groups with that to extract only that part ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.