Remove duplicates and substrings from a list of strings

Question

Let's say that I have a list given by:

a = [
    'www.google.com',
    'google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it',
    'www.google.com'
]

I want to remove the duplicates and the substrings to keep a list like:

b = [
    'www.google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it'
]

Do you know an efficient way to do that?

The goal is to keep the string that is longer, that's why www.google.com is preferred over google.com.

wouldn't google.it be removed as well as it is a substring? How do you choose which item is a substring of which element? Every element with 'google' in it is a substring of one another. — Captain Caveman
– Captain Caveman, Commented Dec 7, 2022 at 14:33
set('.'.join(x.split('.')[-2:]) for x in a) gives {'tvi.pt', 'google.com', 'google.it', 'ubs.ch'}. Close enough? — 001
– 001, Commented Dec 7, 2022 at 14:44
@JohnnyMopp, this doesn't cover TLDs that has 3 levels, for example. — accdias
– accdias, Commented Dec 7, 2022 at 14:46
IMHO, substring is a misused word here. As far as I can see, what OP really means is top level domains. — accdias
– accdias, Commented Dec 7, 2022 at 14:54

001 · Accepted Answer · 2022-12-07 15:14:02Z

This solution can be edited to better suit your needs. Edit the functions get_domain to better choose the grouping condition* and the choose_item function to better choose the best item of the group.

from itertools import groupby

a = ['www.google.com', 'google.com', 'tvi.pt', 'ubs.ch', 'google.it', 'www.google.com']

def get_domain(url):
    # Example: 'www.google.com' -> 'google.com'
    return '.'.join(url.split('.')[-2:])

def choose_item(iterable):
    # Ex. input: ['www.google.com', 'google.com',  'www.google.com']
    # Ex. output: 'www.google.com' (longest string)
    return sorted(iterable, key=lambda x: -len(x))[0]

results = []
for domain,grp in groupby(sorted(a, key=get_domain), key=get_domain):
    results.append(choose_item(grp))

print(results)

Output:

['www.google.com', 'google.it', 'tvi.pt', 'ubs.ch']

_{*Another answers suggest the tld library.}

accdias · Accepted Answer · 2022-12-07 15:09:04Z

If what you are looking for is a list of unique first level domains, given an arbitrary list of URLs, take a look at the tld module. It will make things easier for you.

Based on the documentation, here is a snippet that you can adapt for your needs:

from tld import get_fld

urls = [
    'www.google.com',
    'google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it',
    'www.google.com'
]

unique_domains =  list({
    get_fld(url, fix_protocol=True) for url in urls
})

The code above will set unique_domains with:

['ubs.ch', 'google.it', 'tvi.pt', 'google.com']

Michael · Accepted Answer · 2022-12-07 14:39:21Z

0

You can remove the duplicates as follows:

f = list(dict.fromkeys(a))

this will filter out the duplicates 'www.google.com' but not the substrings. This would need more clarification as Captain Caveman wrote in his comment.

answered Dec 7, 2022 at 14:39

Michael

4921 gold badge5 silver badges13 bronze badges

Comments

Wessel van Leeuwen · Accepted Answer · 2022-12-07 14:41:41Z

def remove_duplicates_and_substrings(input):
output = []
for i in input:
    if i not in output:
        if not any(i in s for s in output):
            output.append(i)
return output

It may not be the best approach but does exactly what it you want it to do. It works by first checking if the string from the input list is not already in the output list. Then it checks if any part of that is already in one of the output strings. If this is not the case it will add it to the output list.

Collectives™ on Stack Overflow

Remove duplicates and substrings from a list of strings

4 Answers 4

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Related