0

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:

this is a row:   http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row:   http://yetanotherdomain.net
this is a row:   https://hereisadomain.org | some_text

I'm currently accessing the data in this column this way:

for row in csv_reader:
    the_url = row[3]

    # this regex is used to find the hrefs
    href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
    for link in href_regex:
         print (link)

Output from the print statement:

http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text

How do I obtain only the URLs?

http://somedomain.com
http://someanotherdomain.com 
http://yetanotherdomain.net
https://hereisadomain.org
0

2 Answers 2

2

Just change your pattern to:

\b(?:http|ftp)s?://\S+

Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.

Check it live here.

Sign up to request clarification or add additional context in comments.

1 Comment

I tried to accept the answer right-way, but I was promoted to wait 10 minutes.
1

Instead of repeating any character at the end

'(?:http|ftp)s?://.*'
                  ^

repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:

'(?:http|ftp)s?://[^ ]*'
                  ^^^^

2 Comments

Actually this is not correct, you are only negating whitespace and not new lines. See here. Plus a word boundary would be nice, too.
@UnbearableLightness OP's code has text before each URL - if that was a problem, he would have seen his http://yetanotherdomain.net mashed together with his https://hereisadomain.org. I'm doubtful about a word boundary, because URLs can contain trailing non-word characters that are still an important part of the URL

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.