0

I already use movie_reviews corpus to make sentiment analysis. I replaced the existing text files with Arabic language text files, but I couldn't read and print them; I have a problem at encoding.

My code:

import nltk
from nltk.corpus import movie_reviews

documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid),category])   

print(documents[0])

I have this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
5
  • 3
    Possible duplicate of Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) Commented Apr 30, 2017 at 22:57
  • i can solve the problem with one text file by determine the path and change encoding to utf , but i couldn't with corpus , could u give me suggestions!!! Commented Apr 30, 2017 at 23:00
  • This is an NLTK thing? Can you post the full stack trace? That looks like a Microsoft byte-order mark (BOM) which suggests that its a problem where a file is opened. Commented Apr 30, 2017 at 23:25
  • yes i import movie_reviews as corpus from nltk Commented Apr 30, 2017 at 23:42
  • NO Answers :(((((( Commented May 1, 2017 at 13:11

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.