How do i truncate url using python [duplicate]

Question

How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only

    youtube.com/video/AiL6nL
    yahoo.com/video/Hhj9B2
    youtube.com/video/MpVHQ
    google.com/video/PGuTN
    youtube.com/video/VU34MI

Is it possible to truncate like this?

Ewan · Accepted Answer · 2013-06-09 06:36:58Z

6

Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.

So you could do the following:

import urlparse
import re

def check_and_add_http(url):
    # checks if 'http://' is present at the start of the URL and adds it if not.
    http_regex = re.compile(r'^http[s]?://')
    if http_regex.match(url):
        # 'http://' or 'https://' is present
        return url
    else:
        # add 'http://' for urlparse to work.
        return 'http://' + url

for url in url_list:
    url = check_and_add_http(url)
    print(urlparse.urlsplit(url)[1])

You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.

edited Jun 9, 2013 at 6:36

answered Jun 7, 2013 at 11:53

Ewan

15.1k6 gold badges50 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

alecxe Over a year ago

Does it really work even without scheme part? I get empty strings.

Brisi Over a year ago

from urlparse import urlparse url = urlparse('youtube.com/video/wpmkqYRfVkk') print "url = " + str (url)

jfs Over a year ago

@alecxe: indeed, urlsplit() doesn't work in this case (because http:// part is missing in the input): urlsplit("youtube.com/video/AiL6nL") -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL', query='', fragment='')

Ewan Over a year ago

updated to check for schema and add a http:// if not present to make parsing easier

mishik · Accepted Answer · 2013-06-07 11:51:40Z

4

You can use split():

myUrl.split(r"/")[0]

to get "youtube.com"

and:

myUrl.split(r"/", 1)[1]

to get everything else

answered Jun 7, 2013 at 11:51

mishik

10k9 gold badges48 silver badges69 bronze badges

1 Comment

jfs Over a year ago

you could use .partition('/')[0]

ojdo · Accepted Answer · 2013-06-07 12:09:40Z

1

I'd use the function urlsplit from the standard library:

from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3

myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'

answered Jun 7, 2013 at 12:09

ojdo

9,1058 gold badges43 silver badges66 bronze badges

Comments

Community · Accepted Answer · 2021-10-07 06:16:59Z

For your particular input, you could use str.partition() or str.split():

print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com

Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:

import urlparse

urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
#                query='', fragment='')

In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:

import re

print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))

Output

youtube.com
yahoo.com
youtube.com
google.com
youtube.com

Note: it doesn't remove the optional port part -- host:port.

kirelagin · Accepted Answer · 2013-06-07 12:01:51Z

0

No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.

>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
                 > 'youtube.com'

answered Jun 7, 2013 at 12:01

kirelagin

13.7k2 gold badges45 silver badges59 bronze badges

Comments

alecxe · Accepted Answer · 2013-06-07 12:04:15Z

0

Just a crazy alternative solution using tldextract:

>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'

answered Jun 7, 2013 at 12:04

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Collectives™ on Stack Overflow

How do i truncate url using python [duplicate]

6 Answers 6

4 Comments

1 Comment

Comments

Output

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related