Python Domain Name Regular Expression Pattern

Question

I would like to be able to match a domain by following the below rules:

The domain name should be a-z | A-Z | 0-9 and hyphen(-)
The domain name should between 1 and 63 characters long
Last Tld must be at least two characters, and a maximum of 6 characters
The domain name should not start or end with hyphen (-) (e.g. -google.com or google-.com)
The domain name can be a subdomain (e.g. mkyong.blogspot.com)

I already have the java flavored regex I just need this python flavored

^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$

I couldn't find any python regex for this matter as everyone expects the use of urlparse. I don't need to split a url by domain, port, tlds and so on, I only need to do a simple domain replace so regex should be the solution for me

What I have done:

expectedstring = re.sub(r"^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$" , "XXX" , string)

Example strings:

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."

expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

List of valid domain names

www.google.com
google.com
mkyong123.com
mkyong-info.com
sub.mkyong.com
sub.mkyong-info.com
mkyong.com.au
g.co
mkyong.t.t.co

List of invalid domain names, and why.

mkyong.t.t.c - Tld must between 2 and 6 long
mkyong,com - Comma is not allow
mkyong - No Tld
mkyong.123 , Tld not allow digit
.com - Must start with [A-Za-z0-9]
mkyong.com/users - No Tld
mkyong.com - Cannot begin with a hyphen -
mkyong-.com - Cannot end with a hyphen -
sub.-mkyong.com - Cannot begin with a hyphen -
sub.mkyong-.com - Cannot end with a hyphen -

What happened when you just tried this "java flavored regex" in Python? Looks like perfectly normal standard regex syntax to me. — tobias_k
– tobias_k, Commented Feb 18, 2016 at 15:18
I'm doing: string = re.sub(r"^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$" , "XXX" , string) and nothing changes — faceoff
– faceoff, Commented Feb 18, 2016 at 15:24
Well, that is a different regex than in your question. Also, what is string? — tobias_k
– tobias_k, Commented Feb 18, 2016 at 15:28
I messed up, I have updated my question to match the correct regex and am using — faceoff
– faceoff, Commented Feb 18, 2016 at 15:37

Quinn · Accepted Answer · 2016-02-19 22:19:25Z

I run a test based on the list of given domain names (python 2.7x):

import re
valid_domains = """
www.google.com
google.com
mkyong123.com
mkyong-info.com
sub.mkyong.com
sub.mkyong-info.com
mkyong.com.au
g.co
mkyong.t.t.co
"""

invalid_domains = """
mkyong.t.t.c
mkyong,com
mkyong
mkyong.123
.com
mkyong.com/users
-mkyong.com
mkyong-.com
sub.-mkyong.com
sub.mkyong-.com
"""

valid_names = valid_domains.split()
invalid_names = invalid_domains.split()

# match 1 character domain name or 2+ domain name
pattern = '^([A-Za-z0-9]\.|[A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9]\.){1,3}[A-Za-z]{2,6}$'

print 'checking valid domain names ============'
for name in valid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

print '\nchecking invalid domain names ============'
for name in invalid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

Output:

checking valid domain names ============
www.google.com                                      True
google.com                                          True
mkyong123.com                                       True
mkyong-info.com                                     True
sub.mkyong.com                                      True
sub.mkyong-info.com                                 True
mkyong.com.au                                       True
g.co                                                True
mkyong.t.t.co                                       True

checking invalid domain names ============
mkyong.t.t.c                                       False
mkyong,com                                         False
mkyong                                             False
mkyong.123                                         False
.com                                               False
mkyong.com/users                                   False
-mkyong.com                                        False
mkyong-.com                                        False
sub.-mkyong.com                                    False
sub.mkyong-.com                                    False

[Edit] To achieve the same result as the expectedstring provided, I come up with the following approach (without checking "http(s)")：

import re

# match 1 character domain name or 2+ domain name
pattern = '(//|\s+|^)(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}'

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."
expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

resultstring = ''.join([re.sub(pattern , "\g<1>XXX" , string)])

print 'resultstring: \n', resultstring
print '\nare they equal? ', expectedstring == resultstring

Output is:

resultstring: 
This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know.

are they equal?  True

Tried your regex against my string string = re.sub(r'^([A-Za-z0-9]\.|[A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9]\.){1,3}[A-Za-z]{2,6}$' , "XXX" , string) and still doesn't make any replacement. I even tested your regex here: regexr.com/3cr2h and still no match
For the online tool at regexr.com, try only one line of string(e.g., www.demo.com), you will find a match.
@faceoff: Just updated with my approach to get the string expected.
Why did you split the string by "http"? What about this: string = re.sub(r"(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]" , "XXX" , string) - does the same job and even if I'm a python newbie, looks simpler.
@faceoff: I tried your regex on regexr.com, see: i.sstatic.net/o8QKp.jpg. The matches are mkyong.123, -mkyong.com, sub.-mkyong.com, sub.mkyong-.com, 3.141, [email protected], mkyong.t.t.t.co, but cannot match www.GOOGLE.com. That's totally wrong. Please try your re.sub to see if you can solve your own question. I know my regex is far from the simplest, but it can do the job of matching or replacing a domain name with "XXX", right?

Collectives™ on Stack Overflow

Python Domain Name Regular Expression Pattern

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related