19
url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY

How to strip http(s) and www from url in Python?

4
  • "http://www.google.com/images"[11:] Commented Nov 17, 2016 at 8:38
  • So What do you want as output? Commented Nov 17, 2016 at 8:39
  • elements after ? are arguments (also called query) - uparse can keep it in separeted variables. Commented Nov 17, 2016 at 8:40
  • 1
    i want output without http(s) and www Commented Nov 17, 2016 at 9:47

4 Answers 4

31

You can use the string method replace:

url = 'http://www.google.com/images'
url = url.replace("http://www.","")

or you can use regular expressions:

import re

url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')
Sign up to request clarification or add additional context in comments.

2 Comments

how if i also want to remove https?
NB: both of these replacements will remove occurrences of the substring anywhere they are found in the url, which might not be intended. Safer to explicitly specify that these substrings have to be found at the beginning of the url.
9

A more elegant solution would be using urlparse:

from urllib.parse import urlparse

def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    parsed_uri = urlparse(url)
    if uri_type == 'both':
        return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    elif uri_type == 'netloc_only':
        return '{uri.netloc}'.format(uri=parsed_uri)

The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.

1 Comment

The question is about removing "http" ("https") and "www". Your code removes only a scheme.
1

Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?

import re
new_url = re.sub('.*w\.', '', url, 1)

1 to not harm websites ending with a w.

edit after clarification

I'd do two steps:

if url.startswith('http'):
    url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
    url = re.sub(r'www.', '', url)

3 Comments

http will always be there. www could be or not.Sometime it could also be https?In that case how should i modify above code to remove https aslo
you mean like this if url.startswith('http'): new_url = re.sub('.*w\.', '', url, 1)
it should be 'https?:\\'
-1

This will replace when http/https exist and finally if www. exist:

url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')

1 Comment

This does triple the work necessary (compared to a regex) and would also replace occurrences of those substrings elsewhere in the url, which would not be intented.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.