6

I am trying to make changes to each string in my Series object 'tweet_text', but for some reason when I print the series object after making changes to the tweets in my for loop, I get the same strings as I had before the for loop. How can I fix this?

import pandas as pd
import re
import string

df = pd.read_csv('sample-tweets.csv',
                 names=['Tweet_Date', 'User_ID', 'Tweet_Text', 'Favorites', 'Retweets', 'Tweet_ID'])

sum_df = df[['User_ID', 'Tweet_ID', 'Tweet_Text']].copy()
sum_df.set_index(['User_ID'])
# print sum_df

tweet_text = df.ix[:, 2]
print type(tweet_text)

# efficiency could be im proved by using translate method
# regex = re.compile('[%s]' % re.escape(string.punctuation))

for tweet in tweet_text:
    tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
    tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)
    tweet = re.sub('#[a-zA-Z0-9]*', '', tweet)
    tweet = re.sub('$[a-zA-Z0-9]*', '', tweet)
    tweet = ''.join(i for i in tweet if not i.isdigit())
    tweet = tweet.replace('"', '')
    tweet = re.sub(r'[\(\[].*?[\)\]]', '', tweet)  # takes out everything between parentheses also, fix this

    # gets rid of all punctuation and emoji's
    tweet = "".join(l for l in tweet if l not in string.punctuation)
    tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)

    # gets ride of all extra spacing
    tweet = tweet.lower()
    tweet = tweet.strip()
    tweet = " ".join(tweet.split())

    count = count + 1
    # print tweet

print tweet_text
1
  • 3
    Because you are taking the tweet in the variable, making some changes to it and then next iteration starts. You are not assigning the changed data back to the series. Commented Jul 6, 2017 at 19:10

2 Answers 2

4

It is happening like that because tweet_text is a copy of a column df.ix[:, 2] for starters. Secondly, this is not pandas way to iterate over Series - you should use apply().

To update your code, everything that goes into the loop, change into function:

def parse_tweet(tweet):
    ## everything from loop goes here
    return tweet

Then, instead of:

tweet_text = df.ix[:, 2]

do:

df.iloc[:, 2] = df.iloc[:, 2].apply(parse_tweet)

BTW, do not use ix indexer as it is depreciated and going to be removed in the future versions of pandas.

Sign up to request clarification or add additional context in comments.

1 Comment

In regards to your most recent pandas answer. People can't up-vote without 15 rep. People asking the questions are your most certain up-vote. If you answer a question of someone without the required rep to up-vote you... do them a favor and up-vote their question to help get them over the line.
1

Python strings are immutable. You are just changing the value attributed to variable tweet, but never actually updating the dataframe.

You just have to reinsert the updated value back to your dataframe. Example of a simple fix:

for i, tweet in enumerate(tweet_text):
    tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
    tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)

    # ...

    # update dataframe
    df.ix[i, 2] = tweet

1 Comment

Thank you! I kept trying to see if dataframes were immutable, but forgot to check if strings are immutable (I would've expected otherwise in python haha)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.