How to efficiently identify substrings in the order of the string in python

Question

This is related to my previous question in: How to identify substrings in the order of the string?

For a given set of sentences and a set of selected_concepts I want to identify selected_concepts in the order of the sentences.

I am doing it fine with the code provided below.

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

However, in my real dataset I have 13242627 selected_conceptsand 1234952 sentences. Therefore, I would like to know if there is any way to optimise this code to perform in lesser time. As I understand this is O(n^2). Therefore, I am concerned about the time complexity (space complexity is not a problem for me).

A sample is mentioned below.

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

Minor point, but key=lambda x: x[0] is unnecessary. Tuples are compared in lexicographixal order anyway. — Mad Physicist
– Mad Physicist, Commented Jan 6, 2019 at 7:56
Can you provide a sample input? And output? The optimal structures and algorithms depend on your data. — Mad Physicist
– Mad Physicist, Commented Jan 6, 2019 at 7:58
@Delgan. I would approach this very differently if they were space separated words vs if they were arbitrary substrings. — Mad Physicist
– Mad Physicist, Commented Jan 6, 2019 at 8:02

Laurent LAPORTE · Accepted Answer · 2019-01-06 08:37:06Z

What about using pre-compiled ReGEx?

Here is an example:

import re

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = [
    'machine learning',
    'patterns',
    'data mining',
    'methods',
    'database systems',
    'interdisciplinary subfield',
    'knowledege discovery',  # spelling error: “knowledge”
    'databases process',
    'information',
    'process']

re_concepts = [re.escape(t) for t in selected_concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

output = [find_all_concepts(sentence) for sentence in sentences]

You get:

[['data mining',
  'process',
  'patterns',
  'methods',
  'machine learning',
  'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'databases process']]

Thanks a lot for the answer. I am wondering why knowledge discovery does not appear in last sentence''s output. Any idea to fix it?
Just because it doesn’t appear in any sentence… There is a small error in the RegEx
Thanks. It soved the issue. What do you mean by 'There is a small error in RegEx'? I did not get it.
See my edit, there is a spelling error in selected_concepts.
Thank you. I will use your code for real dataset and let you know if I could reduce the time :) Thanks once again.

Collectives™ on Stack Overflow

How to efficiently identify substrings in the order of the string in python

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related