0

I am trying to select rows where the "story" column contains any of the strings in my list "selected_words".

I've tried several options including isin and str.contains but I generally only get errors or else an empty dataframe.

df4=pd.read_csv("https://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing")
df4["story"] = df4["story"].astype(str) 
selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts',\
'trusts', 'believes', 'acceptance', 'trusted', 'trusting', 'accepting',\ 'believes', 'believing', 'believed', 'normal', 'normalize', ' normalized',\ 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']
#At this point I am lost as to what to do next

I get either an empty dataframe, or an error message depending on what I try to do.

2
  • can you show us what you've tried and share the results (either empty df or the traceback?) Commented Nov 4, 2019 at 21:27
  • Part of the issue may be this part of the code: df4=pd.read_csv("https://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing") While that URL displays CSV data in a browser, it doesn't return CSV data when called with an HTTP client. Commented Nov 4, 2019 at 21:36

3 Answers 3

1

Try this. I'm unable to load your DF.

df4[df4["story"].isin(selected_words)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer, I tried it and got an empty dataframe, as I had gotten before. Let me see about trying to link the csv differently
0

Here you can see a solution https://stackoverflow.com/a/26577689/12322720

Basically str.contains supports Regular Expressions so you can concatenate with an or pipe

df4[df4.story.str.contains('|'.join(selected_words))]

1 Comment

Thanks for the suggestion, this gave me a list of booleen returns, so I slightly modified it to this: words = df4[df4.story.str.contains('|'.join(selected_words))] and got the necessary result. Thanks so much!!!!!
0

I'm currently learning more Pandas myself so I wanted to contribute an answer I just learned from a book.

It's possible to create a "mask" using a Pandas Series and use that to filter the Dataframe.

import pandas as pd

# This URL doesn't return CSV.
CSV_URL = "https://drive.google.com/open?id=1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW"
# Data file saved from within a browser to help with question.

# I stored the BitcoinData.csv data on my Minio server.
df = pd.read_csv("https://minio.apps.selfip.com/mymedia/csv/BitcoinData.csv")


selected_words = [
    "accept",
    "believe",
    "trust",
    "accepted",
    "accepts",
    "trusts",
    "believes",
    "acceptance",
    "trusted",
    "trusting",
    "accepting",
    "believes",
    "believing",
    "believed",
    "normal",
    "normalize",
    " normalized",
    "routine",
    "belief",
    "faith",
    "confidence",
    "adoption",
    "adopt",
    "adopted",
    "embrace",
    "approve",
    "approval",
    "approved",
    "approves",
]

# %%timeit run in Jupyter notebook

mask = pd.Series(any(word in item for word in selected_words) for item in df["story"])

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# %%timeit run in Jupyter notebook

df[mask]

# results: 955 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# %%timeit run in Jupyter notebook

df[df.story.str.contains('|'.join(selected_words))]

# results 129 ms ± 738 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# True for all
df[mask] == df[df.story.str.contains('|'.join(selected_words))]

# It is possible to calculate the mask inside of the index operation though of course a time penalty is taken rather than using the calculated and stored mask.

# %%timeit run in Jupyter notebook

df[[any(word in item for word in selected_words) for item in df["story"]]]

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# This is still faster than using the alternative `df.story.str.contains`

#

The mask way of searching is significantly faster.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.