Clustering using k-medoids

Question

This is the program function code for clustering using k-medoids

def kMedoids(D, k, tmax=100):
    # determine dimensions of distance matrix D
    m, n = D.shape
    # randomly initialize an array of k medoid indices
    M = np.sort(np.random.choice(n, k)
    # create a copy of the array of medoid indices
    Mnew = np.copy(M)
    # initialize a dictionary to represent clusters
    C = {}
    for t in range(tmax):
    # determine clusters, i.e. arrays of data indices
        J = np.argmin(D[:,M], axis=1)
        for kappa in range(k):
            C[kappa] = np.where(J==kappa)[0]
        # update cluster medoids
        for kappa in range(k):
            J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
            j = np.argmin(J)
            Mnew[kappa] = C[kappa][j]
            np.sort(Mnew)
            # check for convergence
            if np.array_equal(M, Mnew):
                break
                M = np.copy(Mnew)
            else:
                # final update of cluster memberships
                J = np.argmin(D[:,M], axis=1)
                for kappa in range(k):
                    C[kappa] = np.where(J==kappa)[0]
                    # return results
            return M, C

and the I will call The function KMedoids with this program, I think my program run slowly in line D = Pairwise_distances(arraydata, metric='euclidean')

D = pairwise_distances(arraydata,metric='euclidean')

# split into 2 clusters
M, C = kMedoids(D, 2)

print('medoids:')
for point_idx in M:
    print(arraydata[point_idx] )

print('')
# array for get label
temp = []
indeks = []
print('clustering result:')
for label in C:
    for point_idx in C[label]:
        print('label {0}:　{1}'.format(label, arraydata[point_idx]))
        temp.append(label)
        indeks.append(point_idx)

This is the result from this program

clustering result:
label 0:　[0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00

Why my result of my program is slow for large data and almost have a result "Memory Error"? I hope someone can help me to review this code to improve its performance to get the result and process large amounts of data.

This is still on-topic since the error is due to slow performance (on large sets of data). — dustytrash
– dustytrash, Commented Jun 24, 2019 at 16:48
return M, C looks misindented. Please doublecheck your indentation. — vnp
– vnp, Commented Jun 24, 2019 at 17:27

Community · Accepted Answer · 2020-06-10 13:24:26Z

This is exactly the implementation found in NumPy and SciPy Recipes for Data Science on k-Medoids Clustering but with some indentation mistakes (probably due to the copy & paste process). To understand the indentation being used in the algorithm is might be worth to read the answers on the for...else syntax:

https://stackoverflow.com/questions/9979970/why-does-python-use-else-after-for-and-while-loops#9980752

A working script with the correct indentation and some improvement is found at https://github.com/letiantian/kmedoids/blob/master/kmedoids.py :

import numpy as np
import random

def kMedoids(D, k, tmax=100):
   # determine dimensions of distance matrix D
   m, n = D.shape

   if k > n:
       raise Exception('too many medoids')

   # find a set of valid initial cluster medoid indices since we
   # can't seed different clusters with two points at the same location
   valid_medoid_inds = set(range(n))
   invalid_medoid_inds = set([])
   rs,cs = np.where(D==0)
   # the rows, cols must be shuffled because we will keep the first duplicate below
   index_shuf = list(range(len(rs)))
   np.random.shuffle(index_shuf)
   rs = rs[index_shuf]
   cs = cs[index_shuf]
   for r,c in zip(rs,cs):
       # if there are two points with a distance of 0...
       # keep the first one for cluster init
       if r < c and r not in invalid_medoid_inds:
           invalid_medoid_inds.add(c)
   valid_medoid_inds = list(valid_medoid_inds - invalid_medoid_inds)

   if k > len(valid_medoid_inds):
       raise Exception('too many medoids (after removing {} duplicate points)'.format(
        len(invalid_medoid_inds)))

   # randomly initialize an array of k medoid indices
   M = np.array(valid_medoid_inds)
   np.random.shuffle(M)
   M = np.sort(M[:k])

   # create a copy of the array of medoid indices
   Mnew = np.copy(M)

   # initialize a dictionary to represent clusters
   C = {}
   for t in xrange(tmax):
       # determine clusters, i. e. arrays of data indices
       J = np.argmin(D[:,M], axis=1)
       for kappa in range(k):
           C[kappa] = np.where(J==kappa)[0]
       # update cluster medoids
       for kappa in range(k):
           J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
           j = np.argmin(J)
           Mnew[kappa] = C[kappa][j]
       np.sort(Mnew)
       # check for convergence
       if np.array_equal(M, Mnew):
           break
       M = np.copy(Mnew)
   else:
       # final update of cluster memberships
       J = np.argmin(D[:,M], axis=1)
       for kappa in range(k):
           C[kappa] = np.where(J==kappa)[0]

   # return results
   return M, C

Note that the else as well as the return are at the same indentation as the for-loop.

However, I recommend to change the second inner loop

# update cluster medoids
for kappa in range(k):
       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
       j = np.argmin(J)
       Mnew[kappa] = C[kappa][j]

to

# update cluster medoids
for kappa in range(k):
       J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
       # Fix for the low-idx bias by J.Nagele (2019):
       shuffled_idx = np.arange(len(J))
       np.random.shuffle(shuffled_idx) 
       j = shuffled_idx[np.argmin(J[shuffled_idx])] 
       Mnew[kappa] = C[kappa][j]

as otherwise the resulting medoids depend on the order of the input. That is because the argmin command returns always the lowest index in case multiple potential medoids have equal distances to the cluster members.

It might be a better answer if you put the part about the indentation first, remember since this is code review the answer should be mostly about the users code. — pacmaninbw
– pacmaninbw ♦, Commented Dec 17, 2019 at 14:11
Please quote the relevant parts of your links so readers don't have to visit other sites and so that your answer will still make sense even if the link(s) no longer work. — Null
– Null, Commented Dec 17, 2019 at 14:23

Stack Exchange Network

Clustering using k-medoids

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Clustering using k-medoids

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions