0

I need to test if a certain string (for example 'võiks') equals the name of any of the files contained in a directory.

>>>words = [ f.replace('.html', '') for f in listdir('lemma_pages/test') if isfile(join('lemma_pages/test',f)) ]

>>>words
['võibolla', 'võid', 'võiks', 'võimalik', 'võin', 'võta', 'võtan', 'võtta']

>>>'võiks' in words
False

But when I test for it, I get False when I expected otherwise. I am opening the file containing the words in this way:

open('et_500.txt', 'rt', encoding="utf-8")

Any idea of what I am not doing right ?

2
  • What platform are you on? If this is on Mac, see UTF-8 and os.listdir() Commented May 30, 2015 at 14:25
  • whats the result of sys.getdefaultencoding() in your terminal? Commented May 30, 2015 at 14:26

1 Answer 1

2

The data may not be normalized. Before comparing the strings, normalize with:

data = unicodedata.normalize('NFC', data)

To provide some more details, õ could be U+00F5 (LATIN SMALL LETTER O WITH TILDE) or it could be U+0062 (LATIN SMALL LETTER B) followed by U+0303 (COMBINING TILDE). Normalizing is necessary so that no matter which flavor you get, they will compare identically.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.