2

I want to run a sentiment analysis on German headlines and would like to use textblob-de for it. After installing as described here: https://textblob-de.readthedocs.io/en/latest/readme.html#installing-upgrading, I run into the error below when running a very simple script like so:

from textblob_de import TextBlobDE as TextBlob

text = "drama vor lampedusa : mehr als 1000 flüchtlinge im mittelmeer vor italien gerettet"

blob = TextBlob(text)

print(blob.sentiment)

The error states:

Exception has occurred: ModuleNotFoundError No module named 'textblob.translate'

The weird thing is that I have just now installed the package and textblob.translate is actually deprecated as stated in the change logs: https://textblob.readthedocs.io/en/dev/changelog.html

Regardless, textblobde seems to be using it. Do you see a way around this or is this model no longer feasible for German text?

1
  • The textblob.translate() module has been deprecated and they recommend to use the Google Cloud Translation API. There are other Python NLP libraries but you also have to use external translation services. Besides Google there is also Microsoft Translator, Amazon Translate and DeepL. Commented Jan 7 at 17:12

2 Answers 2

1

I had the same problem after a system upgrade and fixed it by patching textblob_de. My pull request with the changes is here, if you need a quick fix as well: https://github.com/markuskiller/textblob-de/pull/25

Edit for content: My patch removes the dependency on translate - which OP doesn’t seem to use anyway -, and also on another deprecated module (compat).

Sign up to request clarification or add additional context in comments.

5 Comments

This is clearly a useful answer, but it risks being deleted as a link only answer. Cam you add a summary of what your patch does here, do this answer can stand by itself?
Thank you, edited. Do you think that suffices?
Yeah, it's minimal but I think it should do. Now let's hope your patch gets merged.
Indeed, it got merged. Let’s hope for a new release so people using pip actually receive it. 🙂
Just downloaded textBlob_de with pip install -U textblob-de and had the same error messsage. So no new release yet. Folllowing the readDocs the sentiment scores are work in progress. If one is not familiar with patching as described above (like myself) then BERT based sentiment or a lexicon is more a promising approach. Uninstalling and reinstalling the package did not fix the problem.
1

I know the question was about textBlob-de, but if one needs to compute Sentiment for German texts and doesn't want to do above solution, here is the code for two approaches: lexical and stanza. Wrote it yesterday.

Lexical

Download the lexicon from the website of the University of Stuttgart. The link has been active for more than 9 years. Else here the citation of the paper: Maximilian Köper, Sabine Schulte im Walde (2016), Automatically Generated Norms of Abstractness, Arousal, Imageability and Valence for 350 000 German Lemmas, In: Proceedings of the 10th Conference on Language Resources and Evaluation (LREC). Portoroz, Slovenia.

import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer

### read in lexicon
lexicon = pd.read_csv(r'_your_path_to_file\valence arousal concreteness German.txt', sep="\t", decimal=".", header = 0)
lexicon = lexicon.set_index("Word") #rename index to "word"
lexicon #have a look at ratings

### define function
def Lexical(df, text_col, Lexion_col):
    # Create new variables to store mean and median scores for each row
    mean_scores = []
    median_scores = []
    
    # Iterate through each row in the specified column of the DataFrame
    for i in df[text_col]:
        # Initialize score list for the current row
        score = []
        # make sure it is a string
        text = str(i)
        
        # Tokenize the text
        words = [word for word in RegexpTokenizer(r'\w+').tokenize(text)]
        
        # Iterate through each word in the tokenized text
        for word in words:
            try:
                # Look up the word in the lexicon and get the value from the specified column                
                
                val = lexicon.loc[word].iloc[Lexion_col]
                # Append the value to the score list
                score.append(val)
            except KeyError:
                # If the word is not found in the lexicon, skip it
                pass
        
        # Calculate mean and median scores for the current row
        mean_scores.append(np.mean(score))
        median_scores.append(np.median(score))
    
    # Create a DataFrame with the mean and median scores for each row
    scores_df = pd.DataFrame({'mean': mean_scores, 'median': median_scores})
    
    return scores_df

Stanza from StanfordNLP

As I have larger texts, I wanted to compute the sentiment per sentence, then compute the mean or median. Else one runs the risk of having neutral sentiment for each text unit, which is not very useful from an analytical point of view.

import stanza
nlp = stanza.Pipeline(lang='de', processors='tokenize,sentiment', tokenize_no_ssplit=True)

def senti_stanza(df):
    meanRating = [] #overall mean score
    medianRating = [] #overall median score
    for i in df:
        # initite variable to store score for each sentence in a text 
        scores = []        
        i = str(i) # convert to string     
        sentences = sent_tokenize(i, language = 'german') # tokenize into sentences 
           
        for s in sentences: 
            stanza_sentences = nlp(s) # convert each sentence into stanza document
            for se, sentence  in enumerate(stanza_sentences.sentences):                    
                scores.append(sentence.sentiment)
        meanRating.append(np.mean(scores))
        medianRating.append(np.median(scores))

    scores_df = pd.DataFrame({'mean': meanRating, 'median': medianRating})

    return scores_df

Stanza computes the sentiment per input unit, i.e. the whole text. Splitting up the pipelines into one tokenizer and one sentiment pipeline returned an error. The sentiment pipeline has tokenize as prerequisite. So the text needs to be tokenized into sentences before it is fed into the stanza model. Either use the NTLK or spacy tokenizer - or build your own with regular expression.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.