1

This is related to my previous question in: How to identify substrings in the order of the string?

For a given set of sentences and a set of selected_concepts I want to identify selected_concepts in the order of the sentences.

I am doing it fine with the code provided below.

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

However, in my real dataset I have 13242627 selected_conceptsand 1234952 sentences. Therefore, I would like to know if there is any way to optimise this code to perform in lesser time. As I understand this is O(n^2). Therefore, I am concerned about the time complexity (space complexity is not a problem for me).

A sample is mentioned below.

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
6
  • 1
    Minor point, but key=lambda x: x[0] is unnecessary. Tuples are compared in lexicographixal order anyway. Commented Jan 6, 2019 at 7:56
  • Can you provide a sample input? And output? The optimal structures and algorithms depend on your data. Commented Jan 6, 2019 at 7:58
  • E.g., I have no idea what a "selected concept" is. Commented Jan 6, 2019 at 7:59
  • @MadPhysicist Those are just possible substrings I guess? Commented Jan 6, 2019 at 8:01
  • 1
    @Delgan. I would approach this very differently if they were space separated words vs if they were arbitrary substrings. Commented Jan 6, 2019 at 8:02

1 Answer 1

1

What about using pre-compiled ReGEx?

Here is an example:

import re

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = [
    'machine learning',
    'patterns',
    'data mining',
    'methods',
    'database systems',
    'interdisciplinary subfield',
    'knowledege discovery',  # spelling error: “knowledge”
    'databases process',
    'information',
    'process']

re_concepts = [re.escape(t) for t in selected_concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

output = [find_all_concepts(sentence) for sentence in sentences]

You get:

[['data mining',
  'process',
  'patterns',
  'methods',
  'machine learning',
  'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'databases process']]
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot for the answer. I am wondering why knowledge discovery does not appear in last sentence''s output. Any idea to fix it?
Just because it doesn’t appear in any sentence… There is a small error in the RegEx
Thanks. It soved the issue. What do you mean by 'There is a small error in RegEx'? I did not get it.
See my edit, there is a spelling error in selected_concepts.
Thank you. I will use your code for real dataset and let you know if I could reduce the time :) Thanks once again.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.