This is related to my previous question in: How to identify substrings in the order of the string?
For a given set of sentences and a set of selected_concepts I want to identify selected_concepts in the order of the sentences.
I am doing it fine with the code provided below.
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)
However, in my real dataset I have 13242627 selected_conceptsand 1234952 sentences. Therefore, I would like to know if there is any way to optimise this code to perform in lesser time. As I understand this is O(n^2). Therefore, I am concerned about the time complexity (space complexity is not a problem for me).
A sample is mentioned below.
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
key=lambda x: x[0]is unnecessary. Tuples are compared in lexicographixal order anyway.