Removing HTTP and WWW from URL python

Question

url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY

How to strip http(s) and www from url in Python?

elements after ? are arguments (also called query) - uparse can keep it in separeted variables. — furas
– furas, Commented Nov 17, 2016 at 8:40

Tomerikoo · Accepted Answer · 2021-11-26 18:06:05Z

31

You can use the string method replace:

url = 'http://www.google.com/images'
url = url.replace("http://www.","")

or you can use regular expressions:

import re

url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')

edited Nov 26, 2021 at 18:06

Tomerikoo

19.5k16 gold badges57 silver badges68 bronze badges

answered Nov 17, 2016 at 8:41

Januka samaranyake

2,6072 gold badges34 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

guri Over a year ago

how if i also want to remove https?

stephentgrammer Over a year ago

NB: both of these replacements will remove occurrences of the substring anywhere they are found in the url, which might not be intended. Safer to explicitly specify that these substrings have to be found at the beginning of the url.

WJA · Accepted Answer · 2020-04-06 08:12:14Z

9

A more elegant solution would be using urlparse:

from urllib.parse import urlparse

def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    parsed_uri = urlparse(url)
    if uri_type == 'both':
        return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    elif uri_type == 'netloc_only':
        return '{uri.netloc}'.format(uri=parsed_uri)

The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.

answered Apr 6, 2020 at 8:12

WJA

7,05420 gold badges99 silver badges172 bronze badges

1 Comment

user9608133 Over a year ago

The question is about removing "http" ("https") and "www". Your code removes only a scheme.

Tristan Bodding-Long · Accepted Answer · 2016-11-17 09:11:22Z

1

Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?

import re
new_url = re.sub('.*w\.', '', url, 1)

1 to not harm websites ending with a w.

edit after clarification

I'd do two steps:

if url.startswith('http'):
    url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
    url = re.sub(r'www.', '', url)

edited Nov 17, 2016 at 9:11

answered Nov 17, 2016 at 8:45

Tristan Bodding-Long

2902 silver badges9 bronze badges

3 Comments

guri Over a year ago

http will always be there. www could be or not.Sometime it could also be https?In that case how should i modify above code to remove https aslo

guri Over a year ago

you mean like this if url.startswith('http'): new_url = re.sub('.*w\.', '', url, 1)

All Іѕ Vаиітy Over a year ago

it should be 'https?:\\'

Limbail · Accepted Answer · 2022-04-04 18:46:16Z

-1

This will replace when http/https exist and finally if www. exist:

url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')

answered Apr 4, 2022 at 18:46

Limbail

11 bronze badge

1 Comment

stephentgrammer Over a year ago

This does triple the work necessary (compared to a regex) and would also replace occurrences of those substrings elsewhere in the url, which would not be intented.

Collectives™ on Stack Overflow

Removing HTTP and WWW from URL python

4 Answers 4

2 Comments

1 Comment

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

3 Comments

1 Comment

Linked

Related