0

I am trying to insert only certain values from a string into table (ie excluding common words) after tokenization in a Python script.

The incoming string might look like "this is a string I want to parse because it mentions IOT". Out of those individual tokens/words, I want to exclude things like "this" "is" "a" "I" "want", etc - but less common tokens like "string" "parse" etc should be kept.

Currently, I plan to have a table of common words I can reference.

While I could do something like INSERT $term$ WHERE NOT IN(SELECT * FROM excludedterm), it seems like there should be a simpler method than building a query per term (and, therefore, a separate check to the db on every term).

Is there a Pythonic way to do an equivalent to NOT IN()... that SQL supports? Maybe reading the excludes table into a list, then comparing tokens against it in some kind of NOT IN($list$) format?

1 Answer 1

2

You can preprocess the data via the the below python snippet.

At the beginning read all the words from the common table and populate the ignore list below.

inp = "this is a string I want to parse because it mentions IOT"
ignored =  ['this', 'is', 'are', 'a', 'to','it','from']
result = [item for item in inp.split() if item not in ignored]
print(result)

Add all the ignored terms to the ignored list. Here, we are using list comprehension for calculating result. Optionally, we can use for loop to achieve the same.

The result is the list. Iterate this to insert into your database.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.