2

I have a string:

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

I want to get this result:

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]

I tried this:

match = re.findall("(^[a-z]+[^://](^[a-z]+\d))", line)

I'm a beginner in Python. If there is somebody who can explain, it would be very nice :D

4
  • First, split the string on the commas with line.split(','). Then apply the regex. Any better? Commented Apr 2, 2017 at 17:38
  • So you want tuples of (method, hostname, port) of the comma separated list of URLs. Right? Commented Apr 2, 2017 at 17:43
  • 1
    Is this backslash inside the input real? It will break some of the suggested answers ;-) also the percentage sign in the last URL (without any URL encoding in sight "smells" like low quality input data ... Commented Apr 2, 2017 at 17:47
  • Doesn't look like real input, since \f is a single formfeed control character. Commented Apr 2, 2017 at 17:55

5 Answers 5

4

I suggest to use urlparse lib that has everything you need instead of a regex.

from urllib.parse import urlparse
def getparts(url):
    return (url.scheme, url.hostname, url.port)

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
urls = [getparts(urlparse(url)) for url in line.split(',')]
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for this. The solution to this sort of problem is the correct parsing library, not another hacky regex.
@Jan I count 4 small regex (compile) in that file, and 14 contributors, not every regex is an hacky regex.
@DRC: Of course not and I was surely overreacting. However, correctly used, one small regex solution will likely be faster than importing the whole parser (just timed it: my solution [see above] is 4 times faster than your "standard" solution). But I guess, time does not matter here, really. See the comparison here
@thanks for being polite, really appreciated. Often the trade off is necessary because for example none of the regex proposed here addresses the case of having a user:password in the hostname part, and given I don't want to study a whole standard again to cover all cases I'm importing a library.
3

You can use the following regex:

([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))

Tested on Regex101:

https://regex101.com/r/hCprgS/3

Try this code:

import re

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,\file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(?=:|)(\d+|(?=[\\\/]))", line)

print(match)

Results:

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('http', 'v2-dbwebb.se', '')]

7 Comments

Should you take into account the possibility of a backward stroke before 'file'?
Updated to support backward slash. Thanks for the input. I also added case insensitive flag as well.
That \f is a single formfeed character.
Okay? How does that apply?
Updated my answer. It works properly. Please remember to mark as solution, it helps a lot.
|
1

Instead of using regex, try using line.split(',') Then iterate through the list, like

myList=[] for l in line.split(','): myList.append(tuple(m.split('/')[0:2]))

It isn't pretty, but it gets around the problem of regex. It doesn't get into the specifics of the URL and FTP, but you can eliminate those systematically.

Comments

0

Python urlparse is the module you need to do all of the work, it has a urlparse constructor function that will parse a URL. The interesting parts of the URL can then be extracted from this object as attribute names. Here is the code:


import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg,file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

# you want the port as a string so adjust it here
def port2str(port):
    if port: return str(port)
    else: return ''


urls = [x.strip() for x in line.split(',')]
result = map(lambda u: (u.scheme, u.hostname, port2str(u.port)), map(lambda url: urlparse.urlparse(url), urls))
print result

The code first breaks your input to an array of strings; note that they need to be clean up (stripped) as some have leading spaces which would break the parser. Then this array is converted to an array of parsed url objects, which is then converted to an array of tuples you want. The reason this is done in two steps here is that unfortunately the python lambda is very restrictive -- it cannot contain statements. (I assumed the \file was a typo)

Comments

0

To provide yet another druidic and hacky regular expression approach:

import re

rx = re.compile(r"""
            (?P<protocol>[^:]+)://  # protocol
            (?P<domain>[^/:]+)      # domain part
            (?::(?P<port>\d+))?     # port, optional
            """, re.VERBOSE)

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

matches = [match.groups() 
           for part in line.split(" ") 
           for match in [rx.match(part)]]
print(matches)
# [('https', 'dbwebb.se', None), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', None)]

See a demo on ideone.com. Otherwise, have a look at @DRC's answer for a very good non-regex way to tackle the problem.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.