url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY
How to strip http(s) and www from url in Python?
You can use the string method replace:
url = 'http://www.google.com/images'
url = url.replace("http://www.","")
or you can use regular expressions:
import re
url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')
A more elegant solution would be using urlparse:
from urllib.parse import urlparse
def get_hostname(url, uri_type='both'):
"""Get the host name from the url"""
parsed_uri = urlparse(url)
if uri_type == 'both':
return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
elif uri_type == 'netloc_only':
return '{uri.netloc}'.format(uri=parsed_uri)
The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.
Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?
import re
new_url = re.sub('.*w\.', '', url, 1)
1 to not harm websites ending with a w.
edit after clarification
I'd do two steps:
if url.startswith('http'):
url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
url = re.sub(r'www.', '', url)
'https?:\\'This will replace when http/https exist and finally if www. exist:
url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')
"http://www.google.com/images"[11:]?are arguments (also calledquery) - uparse can keep it in separeted variables.