I know the question was about textBlob-de, but if one needs to compute Sentiment for German texts and doesn't want to do above solution, here is the code for two approaches: lexical and stanza. Wrote it yesterday.
Lexical
Download the lexicon from the website of the University of Stuttgart. The link has been active for more than 9 years. Else here the citation of the paper: Maximilian Köper, Sabine Schulte im Walde (2016), Automatically Generated Norms of Abstractness, Arousal, Imageability and Valence for 350 000 German Lemmas, In: Proceedings of the 10th Conference on Language Resources and Evaluation (LREC). Portoroz, Slovenia.
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
### read in lexicon
lexicon = pd.read_csv(r'_your_path_to_file\valence arousal concreteness German.txt', sep="\t", decimal=".", header = 0)
lexicon = lexicon.set_index("Word") #rename index to "word"
lexicon #have a look at ratings
### define function
def Lexical(df, text_col, Lexion_col):
# Create new variables to store mean and median scores for each row
mean_scores = []
median_scores = []
# Iterate through each row in the specified column of the DataFrame
for i in df[text_col]:
# Initialize score list for the current row
score = []
# make sure it is a string
text = str(i)
# Tokenize the text
words = [word for word in RegexpTokenizer(r'\w+').tokenize(text)]
# Iterate through each word in the tokenized text
for word in words:
try:
# Look up the word in the lexicon and get the value from the specified column
val = lexicon.loc[word].iloc[Lexion_col]
# Append the value to the score list
score.append(val)
except KeyError:
# If the word is not found in the lexicon, skip it
pass
# Calculate mean and median scores for the current row
mean_scores.append(np.mean(score))
median_scores.append(np.median(score))
# Create a DataFrame with the mean and median scores for each row
scores_df = pd.DataFrame({'mean': mean_scores, 'median': median_scores})
return scores_df
Stanza from StanfordNLP
As I have larger texts, I wanted to compute the sentiment per sentence, then compute the mean or median. Else one runs the risk of having neutral sentiment for each text unit, which is not very useful from an analytical point of view.
import stanza
nlp = stanza.Pipeline(lang='de', processors='tokenize,sentiment', tokenize_no_ssplit=True)
def senti_stanza(df):
meanRating = [] #overall mean score
medianRating = [] #overall median score
for i in df:
# initite variable to store score for each sentence in a text
scores = []
i = str(i) # convert to string
sentences = sent_tokenize(i, language = 'german') # tokenize into sentences
for s in sentences:
stanza_sentences = nlp(s) # convert each sentence into stanza document
for se, sentence in enumerate(stanza_sentences.sentences):
scores.append(sentence.sentiment)
meanRating.append(np.mean(scores))
medianRating.append(np.median(scores))
scores_df = pd.DataFrame({'mean': meanRating, 'median': medianRating})
return scores_df
Stanza computes the sentiment per input unit, i.e. the whole text. Splitting up the pipelines into one tokenizer and one sentiment pipeline returned an error. The sentiment pipeline has tokenize as prerequisite. So the text needs to be tokenized into sentences before it is fed into the stanza model. Either use the NTLK or spacy tokenizer - or build your own with regular expression.
textblob.translate()module has been deprecated and they recommend to use the Google Cloud Translation API. There are other Python NLP libraries but you also have to use external translation services. Besides Google there is also Microsoft Translator, Amazon Translate and DeepL.