Python will check the first or second line for an emacs/vim-like encoding specification.
More precisely, the first or second
line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first
group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation.
Source: PEP 263
(A BOM would also make Python interpret the source as UTF-8.
I would recommend, you use this over .decode('utf8')
# -*- encoding: utf-8 -*-
special_char_string = u"äöüáèô"
In any case, special_char_string will then contain a unicode object, no longer a str.
As you can see, they're both semantically equivalent:
>>> u"äöüáèô" == "äöüáèô".decode('utf8')
True
And the reverse:
>>> u"äöüáèô".encode('utf8')
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'
>>> "äöüáèô"
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'
There is a technical difference, however: if you use u"something", it will instruct the parser that there is a unicode literal, it should be a bit faster.
.decode()? Define a unicode string!u"äöüáèô"u"äöüáèô"is the same as"äöüáèô".decode('utf8')in any case?