4

How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only

    youtube.com/video/AiL6nL
    yahoo.com/video/Hhj9B2
    youtube.com/video/MpVHQ
    google.com/video/PGuTN
    youtube.com/video/VU34MI

Is it possible to truncate like this?

0

6 Answers 6

6

Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.

So you could do the following:

import urlparse
import re

def check_and_add_http(url):
    # checks if 'http://' is present at the start of the URL and adds it if not.
    http_regex = re.compile(r'^http[s]?://')
    if http_regex.match(url):
        # 'http://' or 'https://' is present
        return url
    else:
        # add 'http://' for urlparse to work.
        return 'http://' + url

for url in url_list:
    url = check_and_add_http(url)
    print(urlparse.urlsplit(url)[1])

You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.

Sign up to request clarification or add additional context in comments.

4 Comments

Does it really work even without scheme part? I get empty strings.
from urlparse import urlparse url = urlparse('youtube.com/video/wpmkqYRfVkk') print "url = " + str (url)
@alecxe: indeed, urlsplit() doesn't work in this case (because http:// part is missing in the input): urlsplit("youtube.com/video/AiL6nL") -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL', query='', fragment='')
updated to check for schema and add a http:// if not present to make parsing easier
4

You can use split():

myUrl.split(r"/")[0]

to get "youtube.com"

and:

myUrl.split(r"/", 1)[1]

to get everything else

1 Comment

you could use .partition('/')[0]
1

I'd use the function urlsplit from the standard library:

from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3

myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'

Comments

1

For your particular input, you could use str.partition() or str.split():

print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com

Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:

import urlparse

urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
#                query='', fragment='')

In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:

import re

print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))

Output

youtube.com
yahoo.com
youtube.com
google.com
youtube.com

Note: it doesn't remove the optional port part -- host:port.

Comments

0

No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.

>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
                 > 'youtube.com'

Comments

0

Just a crazy alternative solution using tldextract:

>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.