Extract domain from URL in python [duplicate]

Question

I have an url like:
http://abc.hostname.com/somethings/anything/

I want to get:
hostname.com

What module can I use to accomplish this?
I want to use the same module and method in python2.

url.split('/')[2] will give you 'abc.hostname.com' you can extract it using split or re any method. — Gahan
– Gahan, Commented May 22, 2017 at 12:58

Philipp Claßen · Accepted Answer · 2019-06-06 11:18:26Z

155

For parsing the domain of a URL in Python 3, you can use:

from urllib.parse import urlparse

domain = urlparse('http://www.example.test/foo/bar').netloc
print(domain) # --> www.example.test

However, for reliably parsing the top-level domain (example.test in this example), you need to install a specialized library (e.g., tldextract).

answered Jun 6, 2019 at 11:18

Philipp Claßen

44.5k36 gold badges163 silver badges256 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Herbert · Accepted Answer · 2022-01-12 09:56:45Z

76

Instead of regex or hand-written solutions, you can use python's urlparse

from urllib.parse import urlparse

print(urlparse('http://abc.hostname.com/somethings/anything/'))
>> ParseResult(scheme='http', netloc='abc.hostname.com', path='/somethings/anything/', params='', query='', fragment='')

print(urlparse('http://abc.hostname.com/somethings/anything/').netloc)
>> abc.hostname.com

To get without the subdomain

t = urlparse('http://abc.hostname.com/somethings/anything/').netloc
print ('.'.join(t.split('.')[-2:]))
>> hostname.com

edited Jan 12, 2022 at 9:56

Herbert

5,6856 gold badges52 silver badges77 bronze badges

answered May 22, 2017 at 13:14

philshem

25.5k8 gold badges66 silver badges136 bronze badges

7 Comments

AIpeter Over a year ago

In Python3 the lib urlparse was renamed to urllib.parse.

qasimzee Over a year ago

will it work with something like test.mytest.example.com ?

mommi84 Over a year ago

It will fail with *.co.uk or *.ac.uk domains.

mommi84 Over a year ago

t.split('.')[-2:] literally keeps only the last two substrings, so I am afraid it will just return co.uk and ac.uk, whether you prepend that or not.

user9608133 Over a year ago

This (wrong due to the mentioned reasons) answer has so many up-votes and then we wonder why different software and websites have so many bugs...

|

ifly6 · Accepted Answer · 2021-06-30 21:17:26Z

38

You can use tldextract.

Example code:

from tldextract import extract
tsd, td, tsu = extract("http://abc.hostname.com/somethings/anything/") # prints abc, hostname, com
url = td + '.' + tsu # will prints as hostname.com    
print(url)

edited Jun 30, 2021 at 21:17

ifly6

5,3903 gold badges28 silver badges52 bronze badges

answered May 22, 2017 at 13:41

Deivanai Subramanian

4523 silver badges3 bronze badges

2 Comments

t.m.adam Over a year ago

tldextract is not a standard lib ( at least not in python 2.7 ) , I think you should mention that. Still +1

D09r Over a year ago

Works well! But, getting No handlers could be found for logger "tldextract", how to handle this.

Henry · Accepted Answer · 2017-05-22 12:58:02Z

5

Assuming you have it in an accessible string, and assuming we want to be generic for having multiple levels on the top domain, you could:

token=my_string.split('http://')[1].split('/')[0]
top_level=token.split('.')[-2]+'.'+token.split('.')[-1]

We split first by the http:// to remove that from the string. Then we split by the / to remove all directory or sub-directory parts of the string, and then the [-2] means we take the second last token after a ., and append it with the last token, to give us the top level domain.

There are probably more graceful and robust ways to do this, for example if your website is http://.com it will break, but its a start :)

answered May 22, 2017 at 12:58

Henry

1,68613 silver badges28 bronze badges

4 Comments

Gahan Over a year ago

your code can be simplified more token=my_string.split('/')[2] though it will also work for ftp:// and https:// also.

Henry Over a year ago

That is valid feedback :)

Ed_ Dec 1, 2024 at 15:48

@Gahan that's better but doesn't work on file: urls, which usually start with file:///. try token = url.split (':') [1].lstrip ('/').split ('/') [0]. at least that grabs hostname portion. as a bonus it also removes port number if present, which these answers don't. still have issues with parsing .co.uk domains.

Gahan Dec 9, 2024 at 6:14

@Ed_ file:/// is for local files, in which case, use-case and implementation should have been carefully handled as that is the local files only and does not need to grab any kind of domain from it.

Herbert · Accepted Answer · 2022-01-12 09:58:01Z

-5

Try:

from urlparse import urlparse

parsed = urlparse('http://abc.hostname.com/somethings/anything/')
domain = parsed.netloc.split(".")[-2:]
host = ".".join(domain)
print host  # will prints hostname.com

edited Jan 12, 2022 at 9:58

Herbert

5,6856 gold badges52 silver badges77 bronze badges

answered May 22, 2017 at 13:17

Sathish Kumar VG

2,1821 gold badge14 silver badges20 bronze badges

1 Comment

Quentin Over a year ago

won't work with .co.uk

Collectives™ on Stack Overflow

Extract domain from URL in python [duplicate]

5 Answers 5

Comments

7 Comments

2 Comments

4 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

7 Comments

2 Comments

4 Comments

1 Comment

Linked

Related