Only output matching regex pattern

Question

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:

this is a row:   http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row:   http://yetanotherdomain.net
this is a row:   https://hereisadomain.org | some_text

I'm currently accessing the data in this column this way:

for row in csv_reader:
    the_url = row[3]

    # this regex is used to find the hrefs
    href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
    for link in href_regex:
         print (link)

Output from the print statement:

http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text

How do I obtain only the URLs?

http://somedomain.com
http://someanotherdomain.com 
http://yetanotherdomain.net
https://hereisadomain.org

Paolo · Accepted Answer · 2018-08-04 20:18:40Z

2

Just change your pattern to:

\b(?:http|ftp)s?://\S+

Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.

Check it live here.

edited Aug 4, 2018 at 20:18

answered Aug 4, 2018 at 20:13

Paolo

26.6k8 gold badges51 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Life is complex Over a year ago

I tried to accept the answer right-way, but I was promoted to wait 10 minutes.

CertainPerformance · Accepted Answer · 2018-08-04 20:12:57Z

1

Instead of repeating any character at the end

'(?:http|ftp)s?://.*'
                  ^

repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:

'(?:http|ftp)s?://[^ ]*'
                  ^^^^

answered Aug 4, 2018 at 20:12

CertainPerformance

373k55 gold badges354 silver badges359 bronze badges

2 Comments

Paolo Over a year ago

Actually this is not correct, you are only negating whitespace and not new lines. See here. Plus a word boundary would be nice, too.

CertainPerformance Over a year ago

@UnbearableLightness OP's code has text before each URL - if that was a problem, he would have seen his http://yetanotherdomain.net mashed together with his https://hereisadomain.org. I'm doubtful about a word boundary, because URLs can contain trailing non-word characters that are still an important part of the URL

Collectives™ on Stack Overflow

Only output matching regex pattern

2 Answers 2

1 Comment

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Related