DoI do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.
##TheThis answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.
###TL;DR:
The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.
PS: The UniSearcher is a nice tool to have available on this journey.
#The long way around
The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. TheIn this case the byte sequence should be that of UTF-8 as delivered from the script.
As an example if we take the script of jlarson it takes the data Arraydata Array:
Code-point Glyph UTF-8
----------------------------
U+0500 Ԁ d4 80
U+05E1 ס d7 a1
U+0E01 ก e0 b8 81
U+1054 ၔ e1 81 94
##Importing#By sample provided —, â€, “
We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.
To quote Wikipedia on Windows-1252:
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by
default in the legacy components of Microsoft Windows in English and some other
Western languages. It is one version within the group of Windows code pages.
In LaTeX packages, it is referred to as "ansinew".
##Retrieving the original bytes
To translate it back into it's original form we can look at the code page layout, from which we get:
Character: <â> <€> <”> <,> < > <â> <€> < > <,> < > <â> <€> <œ>
U.Hex : e2 20ac 201d 2c 20 e2 20ac 9d 2c 20 e2 20ac 153
T.Hex : e2 80 94 2c 20 e2 80 9d* 2c 20 e2 80 9c
U is short for Unicode
T is short for Translated
For example:
â => Unicode 0xe2 => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80
Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.
Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:
set fenc=utf-16
# Or
set fenc=ucs-2
##Bytes to UTF-8
We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.
0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash) —
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq) “
Conclusion; The original UTF-8 string was:
—, ”, “
##Mangling it back
We can also do the reverse. The original string as bytes:
UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c
Corresponding values in cp-1252:
e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...
and so on, result:
—, â€, “
#Importing to MS Excel
From post the
Do not save the file with an recognized extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.
As an example I savedsave the file as testfile"testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.
Select encoding and proceed.
###Check that Excel and selected font actually supports the glyph
If support for the code points exist, the text should render fine.
##Why#Why it works (or should)
As we quickly register the escape sequences are equal to the oneones in outthe hex dump above:
or, testing a 4-byte code:
#If this is does not comply