I work with a large amount of data and the execution time of this piece of code is very very important. The results in each iteration are interdependent, so it's hard to make it in parallel. It would be awesome if there is a faster way to implement some parts of this code, like:
- finding the max element in the matrix and its indices
- changing the values in a row/column with the max from another row/column
- removing a specific row and column
Filling the weights matrix is pretty fast.
The code does the following:
- it contains a list of lists of words
word_list, withcountelements in it. At the beginning each word is a separate list. - it contains a two dimensional list (
countxcount) of float valuesweights(lower triangular matrix, the values for whichi>=jare zeros) - in each iteration it does the following:
- it finds the two words with the most similar value (the max element in the matrix and its indices)
- it merges their row and column, saving the larger value from the two in each cell
- it merges the corresponding word lists in
word_list. It saves both lists in the one with the smaller index (max_j) and it removes the one with the larger index (max_i).
- it stops if the largest value is less then a given
THRESHOLD
I might think of a different algorithm to do this task, but I have no ideas for now and it would be great if there is at least a small performance improvement.
I tried using NumPy but it performed worse.
weights = fill_matrix(count, N, word_list)
while 1:
# find the max element in the matrix and its indices
max_element = 0
for i in range(count):
max_e = max(weights[i])
if max_e > max_element:
max_element = max_e
max_i = i
max_j = weights[i].index(max_e)
if max_element < THRESHOLD:
break
# reset the value of the max element
weights[max_i][max_j] = 0
# here it is important that always max_j is less than max i (since it's a lower triangular matrix)
for j in range(count):
weights[max_j][j] = max(weights[max_i][j], weights[max_j][j])
for i in range(count):
weights[i][max_j] = max(weights[i][max_j], weights[i][max_i])
# compare the symmetrical elements, set the ones above to 0
for i in range(count):
for j in range(count):
if i <= j:
if weights[i][j] > weights[j][i]:
weights[j][i] = weights[i][j]
weights[i][j] = 0
# remove the max_i-th column
for i in range(len(weights)):
weights[i].pop(max_i)
# remove the max_j-th row
weights.pop(max_i)
new_list = word_list[max_j]
new_list += word_list[max_i]
word_list[max_j] = new_list
# remove the element that was recently merged into a cluster
word_list.pop(max_i)
count -= 1
weightsas a numpy matrix and leave the code the same, or did you use numpy functions (which are often quite quick)? For example, your first loop could bemax_idx = numpy.argmax(weights); max_i, max_j = numpy.unravel_index(max_el_idx,weights.shape). Similarly, your firstj in range(count)loop could becomeweights[max_j,:] = numpy.maximum(weights[max_i,:],weights[max_j,:]). If you're careful to use built in functions and vectorised operations (working on the whole array at a time) you could probably gain.word_listandweights(just enough so that yout algorithm actually gives meaningful results). I'm fairly sure it can be optimized greatly with numpy.