218

In my Django app, I need to get the host name from the referrer in request.META.get('HTTP_REFERER') along with its protocol so that from URLs like:

  • https://docs.google.com/spreadsheet/ccc?key=blah-blah-blah-blah#gid=1
  • https://stackoverflow.com/questions/1234567/blah-blah-blah-blah
  • http://www.example.com
  • https://www.other-domain.example/whatever/blah/blah/?v1=0&v2=blah+blah

I should get:

  • https://docs.google.com/
  • https://stackoverflow.com/
  • http://www.example.com
  • https://www.other-domain.example/

I looked over other related questions and found about urlparse, but that didn't do the trick since

>>> urlparse(request.META.get('HTTP_REFERER')).hostname
'docs.google.com'

16 Answers 16

378

You should be able to do it with urlparse (docs: python2, python3):

from urllib.parse import urlparse
# from urlparse import urlparse  # Python 2
parsed_uri = urlparse('http://stackoverflow.com/questions/1234567/blah-blah-blah-blah' )
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print(result)

# gives
'http://stackoverflow.com/'
Sign up to request clarification or add additional context in comments.

6 Comments

this answer adds a / to the third example http://www.domain.com, but I think this might be a shortcoming of the question, not of the answer.
@TokenMacGuy: ya, my bad... didn't notice the missing /
I don't think this is a good solution, as netloc is not domain: try urlparse.urlparse('http://user:[email protected]:8080') and find it gives parts like 'user:pass@' and ':8080'
The urlparse module is renamed to urllib.parse in Python 3. So, from urllib.parse import urlparse
This answers what the author meant to ask, but not what was actually stated. For those looking for domain name and not hostname (as this solution provides) I suggest looking at dm03514's answer that is currently below. Python's urlparse cannot give you domain names. Something that seems an oversight.
|
97

https://github.com/john-kurkowski/tldextract

This is a more verbose version of urlparse. It detects domains and subdomains for you.

From their documentation:

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> '.'.join(ext[:2]) # rejoin subdomain and domain
'forums.bbc'

1 Comment

This is the correct answer for the question as written, how to get the DOMAIN name. The chosen solution provides the HOSTNAME, which I believe is what the author wanted in the first place.
52

Python3 using urlsplit:

from urllib.parse import urlsplit
url = "http://stackoverflow.com/questions/9626535/get-domain-name-from-url"
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(url))
print(base_url)
# http://stackoverflow.com/

Comments

35
>>> import urlparse
>>> url = 'http://stackoverflow.com/questions/1234567/blah-blah-blah-blah'
>>> urlparse.urljoin(url, '/')
'http://stackoverflow.com/'

2 Comments

For Python 3 the import is from urllib.parse import urlparse.
The argument doesn't seem intuitive, but it works great as a very simple native solution
26

Pure string operations :):

>>> url = "http://stackoverflow.com/questions/9626535/get-domain-name-from-url"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'stackoverflow.com'
>>> url = "stackoverflow.com/questions/9626535/get-domain-name-from-url"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'stackoverflow.com'
>>> url = "http://foo.bar?haha/whatever"
>>> url.split("//")[-1].split("/")[0].split('?')[0]
'foo.bar'

That's all, folks.

2 Comments

Good and simple option, but fails in some cases, e.g. foo.bar?haha
@SimonSteinberger :-) How'bout this : url.split("//")[-1].split("/")[0].split('?')[0] :-))
18

The standard library function urllib.parse.urlsplit() is all you need. Here is an example for Python3:

>>> import urllib.parse
>>> o = urllib.parse.urlsplit('https://user:[email protected]:8080/dir/page.html?q1=test&q2=a2#anchor1')
>>> o.scheme
'https'
>>> o.netloc
'user:[email protected]:8080'
>>> o.hostname
'www.example.com'
>>> o.port
8080
>>> o.path
'/dir/page.html'
>>> o.query
'q1=test&q2=a2'
>>> o.fragment
'anchor1'
>>> o.username
'user'
>>> o.password
'pass'

Comments

9

if you think your url is valid then this will work all the time

domain = "http://google.com".split("://")[1].split("/")[0] 

3 Comments

The last split is wrong, there are no more forward slashes to split.
it's won't be a problem, if there are no more slashes then, the list will return with one element. so it will work whether there is a slash or not
I edited your answer the be able to remove the down-vote. Nice explanation. Tks.
6

Here is a slightly improved version:

urls = [
    "http://stackoverflow.com:8080/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "Stackoverflow.com:8080/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "http://stackoverflow.com/some/folder?test=/questions/9626535/get-domain-name-from-url",
    "https://StackOverflow.com:8080?test=/questions/9626535/get-domain-name-from-url",
    "stackoverflow.com?test=questions&v=get-domain-name-from-url"]
for url in urls:
    spltAr = url.split("://");
    i = (0,1)[len(spltAr)>1];
    dm = spltAr[i].split("?")[0].split('/')[0].split(':')[0].lower();
    print dm

Output

stackoverflow.com
stackoverflow.com
stackoverflow.com
stackoverflow.com
stackoverflow.com

Fiddle: https://pyfiddle.io/fiddle/23e4976e-88d2-4757-993e-532aa41b7bf0/?i=true

3 Comments

IMHO the best solution, because simple and it considers all sorts of rare cases. Thanks!
neither simple nor improved
This is not a solution for the question because you do not provide protocol (https:// or http://)
5

Is there anything wrong with pure string operations:

url = 'http://stackoverflow.com/questions/9626535/get-domain-name-from-url'
parts = url.split('//', 1)
print parts[0]+'//'+parts[1].split('/', 1)[0]
>>> http://stackoverflow.com

If you prefer having a trailing slash appended, extend this script a bit like so:

parts = url.split('//', 1)
base = parts[0]+'//'+parts[1].split('/', 1)[0]
print base + (len(url) > len(base) and url[len(base)]=='/'and'/' or '')

That can probably be optimized a bit ...

1 Comment

it's not wrong but we got a tool that already does the work, let's not reinvent the wheel ;)
3

I know it's an old question, but I too encountered it today. Solved this with an one-liner:

import re
result = re.sub(r'(.*://)?([^/?]+).*', '\g<1>\g<2>', url)

Comments

2

This is a bit obtuse, but uses urlparse in both directions:

import urlparse
def uri2schemehostname(uri):
    urlparse.urlunparse(urlparse.urlparse(uri)[:2] + ("",) * 4)

that odd ("",) * 4 bit is because urlparse expects a sequence of exactly len(urlparse.ParseResult._fields) = 6

Comments

2

It could be solved by re.search()

import re
url = 'https://docs.google.com/spreadsheet/ccc?key=blah-blah-blah-blah#gid=1'
result = re.search(r'^http[s]*:\/\/[\w\.]*', url).group()
print(result)

#result
'https://docs.google.com'

1 Comment

Does not include port
2

You can simply use urljoin with relative root '/' as second argument:

import urllib.parse


url = 'https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url'
root_url = urllib.parse.urljoin(url, '/')
print(root_url)

Comments

2

This is the simple way to get the root URL of any domain.

from urllib.parse import urlparse

url = urlparse('https://stackoverflow.com/questions/9626535/')
root_url = url.scheme + '://' + url.hostname
print(root_url) # https://stackoverflow.com

Comments

-1

If it contains less than 3 slashes thus you've it got and if not then we can find the occurrence between it:

import re

link = http://forum.unisoftdev.com/something

slash_count = len(re.findall("/", link))
print slash_count # output: 3

if slash_count > 2:
   regex = r'\:\/\/(.*?)\/'
   pattern  = re.compile(regex)
   path = re.findall(pattern, url)

   print path

Comments

-1

to get domain/hostname and Origin*

url = 'https://stackoverflow.com/questions/9626535/get-protocol-host-name-from-url'
hostname = url.split('/')[2] # stackoverflow.com
origin = '/'.join(url.split('/')[:3]) # https://stackoverflow.com

*Origin is used in XMLHttpRequest headers

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.