I am again trying to improve the execution time of this piece of code. Since computing is really time-consuming, I believe that parallelizing the code would be the best solution.
I first worked with maps, as described in this question, but then I tried a simpler approach, thinking that I could find a better solution. However, I could not think of anything, since this is another problem, I decided to publish it as a new question.
I work on a Windows platform using Python 3.4.
Here is the code:
similarity_matrix = [[0 for x in range(word_count)] for x in range(word_count)] for i in range(0, word_count): for j in range(0, word_count): if i > j: similarity = calculate_similarity(t_matrix[i], t_matrix[j]) similarity_matrix[i][j] = similarity similarity_matrix[j][i] = similarity
This is the calculate_similarity function:
def calculate_similarity(array_word1, array_word2): denominator = sum([array_word1[i] + array_word2[i] for i in range(word_count)]) if denominator == 0: return 0 numerator = sum([2 * min(array_word1[i], array_word2[i]) for i in range(word_count)]) return numerator / denominator
And the code explanation:
word_count - the total number of unique words stored in the listt_matrix is a matrix containing a value for each word pair- the output should be
similarity_matrix , the size of which word_count x word_count also contains a similarity value for each pair of words - normally store both matrices in memory
- after these calculations, I can easily find the most similar word for each word (or the top three similar words, as may be required)
calculate_similarity accepts two lists of floats, each for a separate word (each row in t_matrix)
I work with a list of 13k words, and if I correctly calculated the runtime on my system, it would be a few days. So, everything that does the work in one day would be wonderful!
Perhaps only simplifying the calculation of numerator and denominator in calculate_similarity will improve significantly.