Regex in Python?

Question

I have a string:

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

I want to get this result:

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]

I tried this:

match = re.findall("(^[a-z]+[^://](^[a-z]+\d))", line)

I'm a beginner in Python. If there is somebody who can explain, it would be very nice :D

First, split the string on the commas with line.split(','). Then apply the regex. Any better? — Mikael
– Mikael, Commented Apr 2, 2017 at 17:38
So you want tuples of (method, hostname, port) of the comma separated list of URLs. Right? — Keith
– Keith, Commented Apr 2, 2017 at 17:43
Is this backslash inside the input real? It will break some of the suggested answers ;-) also the percentage sign in the last URL (without any URL encoding in sight "smells" like low quality input data ... — Dilettant
– Dilettant, Commented Apr 2, 2017 at 17:47
Doesn't look like real input, since \f is a single formfeed control character. — Mark Tolonen
– Mark Tolonen, Commented Apr 2, 2017 at 17:55

DRC · Accepted Answer · 2017-04-02 17:39:36Z

4

I suggest to use urlparse lib that has everything you need instead of a regex.

from urllib.parse import urlparse
def getparts(url):
    return (url.scheme, url.hostname, url.port)

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
urls = [getparts(urlparse(url)) for url in line.split(',')]

answered Apr 2, 2017 at 17:39

DRC

5,0582 gold badges23 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ShadowRanger Over a year ago

Thank you for this. The solution to this sort of problem is the correct parsing library, not another hacky regex.

DRC Over a year ago

@Jan I count 4 small regex (compile) in that file, and 14 contributors, not every regex is an hacky regex.

Jan Over a year ago

@DRC: Of course not and I was surely overreacting. However, correctly used, one small regex solution will likely be faster than importing the whole parser (just timed it: my solution [see above] is 4 times faster than your "standard" solution). But I guess, time does not matter here, really. See the comparison here

DRC Over a year ago

@thanks for being polite, really appreciated. Often the trade off is necessary because for example none of the regex proposed here addresses the case of having a user:password in the hostname part, and given I don't want to study a whole standard again to cover all cases I'm importing a library.

Neil · Accepted Answer · 2017-04-02 18:42:33Z

3

You can use the following regex:

([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))

Tested on Regex101:

https://regex101.com/r/hCprgS/3

Try this code:

import re

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))", line)

print(match)

Results:

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('http', 'v2-dbwebb.se', '')]

edited Apr 2, 2017 at 18:42

answered Apr 2, 2017 at 17:44

Neil

14.3k3 gold badges35 silver badges53 bronze badges

7 Comments

Bill Bell Over a year ago

Should you take into account the possibility of a backward stroke before 'file'?

Neil Over a year ago

Updated to support backward slash. Thanks for the input. I also added case insensitive flag as well.

Mark Tolonen Over a year ago

That \f is a single formfeed character.

Neil Over a year ago

Okay? How does that apply?

Neil Over a year ago

Updated my answer. It works properly. Please remember to mark as solution, it helps a lot.

|

BarocliniCplusplus · Accepted Answer · 2017-04-02 17:44:01Z

1

Instead of using regex, try using line.split(',') Then iterate through the list, like

myList=[] for l in line.split(','): myList.append(tuple(m.split('/')[0:2]))

It isn't pretty, but it gets around the problem of regex. It doesn't get into the specifics of the URL and FTP, but you can eliminate those systematically.

edited Apr 2, 2017 at 17:44

answered Apr 2, 2017 at 17:36

BarocliniCplusplus

2833 silver badges11 bronze badges

Comments

kozel · Accepted Answer · 2017-04-02 18:32:22Z

Python urlparse is the module you need to do all of the work, it has a urlparse constructor function that will parse a URL. The interesting parts of the URL can then be extracted from this object as attribute names. Here is the code:


import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

# you want the port as a string so adjust it here
def port2str(port):
    if port: return str(port)
    else: return ''


urls = [x.strip() for x in line.split(',')]
result = map(lambda u: (u.scheme, u.hostname, port2str(u.port)), map(lambda url: urlparse.urlparse(url), urls))
print result

The code first breaks your input to an array of strings; note that they need to be clean up (stripped) as some have leading spaces which would break the parser. Then this array is converted to an array of parsed url objects, which is then converted to an array of tuples you want. The reason this is done in two steps here is that unfortunately the python lambda is very restrictive -- it cannot contain statements. (I assumed the \file was a typo)

Jan · Accepted Answer · 2017-04-02 21:26:35Z

To provide yet another druidic and hacky regular expression approach:

import re

rx = re.compile(r"""
            (?P<protocol>[^:]+)://  # protocol
            (?P<domain>[^/:]+)      # domain part
            (?::(?P<port>\d+))?     # port, optional
            """, re.VERBOSE)

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

matches = [match.groups() 
           for part in line.split(" ") 
           for match in [rx.match(part)]]
print(matches)
# [('https', 'dbwebb.se', None), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', None)]

See a demo on ideone.com. Otherwise, have a look at @DRC's answer for a very good non-regex way to tackle the problem.

Collectives™ on Stack Overflow

Regex in Python?

5 Answers 5

4 Comments

7 Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

7 Comments

Comments

Comments

Comments

Related