I have a pretty simple problem. A large set of unique strings that are to various degrees not clean, and in reality they in fact represent the same underlying reality. They are not person names, but they could be. So for example Dr. John Holmes and Dr. Jon Holms are more than likely the same person. From the list of unqiue strings I want to cluster together those that possibly match the same underlying reality. There is no restrictions on the size of the cluster (i.e. I am not expecting only one match per string).
I calculate the distance between strings using the Jaro-Distance with jellyfish in python. But then I wondered how I cluster them together? I was unable to figure out how to do this using scipy clusters as I am working with strings, not floats, so I decided to work according to the following algorithm.
- Make the first object the centroid for the first cluster
- For the next object, calculate the similarity with each existing cluster centroid If the highest calculated similarity is over some threshold value then add the object to the relevant cluster and re-determine the centroid of that cluster, otherwise use the object to initiate a new cluster.
- If any object remains to be clustered then return to step
Also, I decided to strip out generic components of the strings to improve the matches. In the name example above, this would give Dr. John Holmes, and J. Holmes a better chance of matching if 'Dr' is stripped out as a generic component.
I created a Stripped Class that has an attribute that is the original unique string, and then attribute that is a stripped version (stripped of the generic components). This is becasue once stripped many of the strings could have exactly the same values, and I need to be able to identify which original string they identify.
The code is posted below. It works well, except that the size/number of clusters is not independent of the order in which the strings are evaluated. So the results do differ.
In the final function, the SLink fucntion is called inside another function
I would specifically like to know if anyone knows of a better way to implement this type of clustering? If anyone has comments on how I could improve my code, and if anyone has any information about clustering strings in scipy.
In an IPython notebook I go through in more detail how I could refine this system, including using another function to verify the clusters using alternate data points relating to each string (such as age). If anyone would like to view that and offer guidance, then please see the following link:
http://nbviewer.ipython.org/gist/MajorGressingham/7876252
I think as the functions were somewhat trial and error and unplanned, they may have got a bit out of control. Any guidance appreciated. I am a researcher living in Bangladesh, a total programming data tripper sadly, but trying to get better every day.
class Stripped:
    'Common base class for all stripped stings'
    def __init__(self, original, GenericAll, GenericWhite = None, DelWhite = False):
        # Class attribute that is the string in its original format
        self.original = original
        StrVal = original.lower()
        StrVal = re.sub('[^A-Za-z0-9 ' ']+', ' ', StrVal)
        #strip out all occurences of sub-strings from GenericAll list that appear anywhere in the string
        for word in GenericAll:
            RegEx1 = re.compile('' + word)
            StrVal = re.sub(RegEx1, '', StrVal)
        # If provided as argument strip out all occurences of sub-string when that sub string is surrounded by 
        # whitespace (i.e. is not part of another substring-sequence)
        if not GenericWhite == None:
            for word in GenericWhite:
                RegEx2 = re.compile(r'\b' + word + r'\b')
                StrVal = re.sub(RegEx2, '', StrVal)
        # Removes special characters, removes all whitespace
        if DelWhite == True:
            StrVal = StrVal.replace(' ', '')
        # Class attribute that is the stipped string
        self.stripped = StrVal
GenericAll is a list of substrings to be stripped from each string wherever the sub string appears in the string. GenericWhite is a list of substrings that are stripped only if they are surrounded by whitespace.
And the function that implements the clustering
def SlinkSC(ClassList, Threshold):
#1
    random.shuffle(ClassList)
    Clusters = []
    ClustersStripped = []
    Centroid = []
    Scores = []
    for StrippedClass in ClassList:     
        SPScores = []
        Matched = 0
        if len(Clusters) == 0:
            Clusters.append([StrippedClass.original])
            ClustersStripped.append([StrippedClass.stripped])
            Centroid.append([StrippedClass.stripped, StrippedClass.original])
            Scores.append([])
            continue
        for ClustNum in xrange(len(Clusters)):
            Dist = jf.jaro_distance(StrippedClass.stripped, Centroid[ClustNum][0])
            SPScores.append(Dist)
        MaxVal = max(SPScores)
        MaxInd = SPScores.index(max(SPScores))
        if MaxVal >= Threshold:
            Clusters[MaxInd].append(StrippedClass.original)
            ClustersStripped[MaxInd].append(StrippedClass.stripped)
            if len(Scores[MaxInd]) == 0:
                Scores[MaxInd].append(MaxVal)               
            else:
                if MaxVal > Scores[MaxInd][0]:
                    Scores[MaxInd][0] = MaxVal
                    Centroid[MaxInd][0] = StrippedClass.stripped
                    Centroid[MaxInd][1] = StrippedClass.original
            Matched = 1
        if Matched ==0:       
            Clusters.append([StrippedClass.original])
            ClustersStripped.append([StrippedClass.stripped])
            Centroid.append([StrippedClass.stripped, StrippedClass.original])
            Scores.append([])
    return [Clusters, ClustersStripped, Centroid]
ClassList is a list of Stripped Class obejcts created from the unique strings that are to be clustered. Threshold is the threshold above which strings will be clustered when distance between the centorid and the string being evaluated is calcuated as the jaro-distance.
Lastly I wrote a function that sort of combines all of the above, but also clusters the clusters, and calls the clustering function recursively until no more clusters are found. It also references the data frame from which the strings are pulled. In theory, if two non-indentical strings represent the same undelying reality, this third data point should be equal for the non-identical stings that are above the Jaro-Distance threshold. This function directly modifies the data frame from which both the list of unqiue strings and the verifying data point are drawn, until no more modifications are possible.
That looks like this:
 def ClustersRecursive2(Threshold, df, col1, col2, GenericAll, GenericWhite = None, DelWhite = False):
    Styles = df[col1].unique()
    ClassList = [Stripped(style, GenericAll, GenericWhite, DelWhite) for style in Styles]
    ClustersDict = {}
    Clusters = SlinkSC(ClassList, Threshold)
    ClustersOriginal = Clusters[0]
    ClustersCentroid = Clusters[2]
    IndList = [x for x in xrange(len(ClustersOriginal)) if len(ClustersOriginal[x]) > 1]   
    MultiClusters = [ClustersOriginal[Ind] for Ind in IndList]
    MultiCentroid = [ClustersCentroid[Ind] for Ind in IndList]
   if len(MultiClusters) == 0:
        return 
    else:
        Counter = 0
        for cluster in xrange(len(MultiClusters)):
            MultiSMV = list(itertools.chain(*[df[df[col1] == elem][col2].unique() for elem in MultiClusters[cluster]]))
            if len(set(MultiSMV)) == 1:
                ClustersDict[MultiCentroid[cluster][1]] = MultiClusters[cluster]
                Counter +=1
            else:
                if len(MultiSMV) == len(MultiClusters[cluster]):
                    for smv in list(set(MultiSMV)):
                        if MultiSMV.count(smv) >= 2:
                            BoolList = [True if elem == smv else False for elem in MultiSMV]
                            StrList = [MultiClusters[cluster][x] for x in xrange(len(BoolList)) if BoolList[x] == True]
                            StrList.sort(lambda x, y: cmp(len(x), len(y)))
                            ClustersDict[StrList[0]] = StrList
                            Counter +=1
    #Cluster the Clusters
    ClusClusList = [Stripped(style, GenericAll, GenericWhite, DelWhite) for style in ClustersDict.keys()]
    ClusClusters = SlinkSC(ClusClusList, Threshold)
    ClusClusOrig = ClusClusters[0]
    ClusClusCentroid = ClusClusters[0]
    IndList2 = [x for x in xrange(len(ClusClusOrig)) if len(ClusClusOrig[x]) > 1]
    MultiClusClus = [ClusClusOrig[Ind] for Ind in IndList2]
    if len(MultiClusClus) > 0:
        for CCluster in xrange(len(MultiClusClus)):
            MultiSMVCC = list(itertools.chain(*[df[df[col1] == elem][col2].unique() for elem in MultiClusClus[CCluster]]))
            if len(set(MultiSMVCC)) == 1:
                NewList = []
                for x in xrange(1, len(MultiClusClus[CCluster])):
                    NewList.extend(ClustersDict[MultiClusClus[CCluster][x]])
                    del ClustersDict[MultiClusClus[CCluster][x]]
                ClustersDict[MultiClusClus[CCluster][0]].extend(NewList)
    StringMap = {}
    for key, value in ClustersDict.iteritems():
        for elem in value:
            StringMap[elem] = key
    if Counter == 0:
        return 
    else:
        for row in DF.index:
            try:
                df[col1][row] = StringMap[df[col1][row]]
            except KeyError:
                pass
        ClustersRecursive(Threshold, df, col1, col2, GenericAll, GenericWhite = None, DelWhite = False) 



scipy.cluster.hierarchy? \$\endgroup\$