Skip to main content
Better link to sample demo. Does not mangle layout.
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Added UTF-16 escape sample fiddle and code + some notes.
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34

As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

You can find a fiddle here:

Unicode export sample Fiddle

Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.


Typically use like:

// Initiate
encoder = new DataEnc({
    mime   : 'text/csv',
    charset: 'UTF-16BE',
    bom    : true
});

// Convert data to percent escaped text
encoder.enc(data);

// Get result
var result = encoder.pay();

There is two result properties of the object:

1.) encoder.lead

This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

data:[<MIME-type>][;charset=<encoding>][;base64]

You can specify base64, but there is no base64 conversion (at least not this far).

2.) encoder.buf

This is a string with the percent escaped data.

The .pay() function simply return 1.) and 2.) as one.


###Main code:


function DataEnc(a) {
    this.config(a);
    this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
        u8    : ['u8', 'utf8'],
        // RFC-2781, Big endian should be presumed if none given
        u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
        u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
        'none'     : '',
        'UTF-8'    : '%ef%bb%bf', // Discouraged
        'UTF-16BE' : '%fe%ff',
        'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
    // Basic setup
    config : function(a) {
        var opt = {
            charset: 'u8',
            mime   : 'text/csv',
            base64 : 0,
            bom    : 0
        };
        a = a || {};
        this.charset = typeof a.charset !== 'undefined' ?
                        a.charset : opt.charset;
        this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
        this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
        this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;

        this.enc = this.utf8;
        this.buf = '';
        this.lead = '';
        return this;
    },
    // Create lead based on config
    // data:[<MIME-type>][;charset=<encoding>][;base64],<data>
    intro : function() {
        var
            g = [],
            c = this.charset || '',
            b = 'none'
        ;
        if (this.mime && this.mime !== '')
            g.push(this.mime);
        if (c !== '') {
            c = c.replace(/[-\s]/g, '').toLowerCase();
            if (DataEnc._enctype.u8.indexOf(c) > -1) {
                c = 'UTF-8';
                if (this.bom)
                    b = c;
                this.enc = this.utf8;
            } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
                c = 'UTF-16BE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16be;
            } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
                c = 'UTF-16LE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16le;
            } else {
                if (c === 'copy')
                    c = '';
                this.enc = this.copy;
            }
        }
        if (c !== '')
            g.push('charset=' + c);
        if (this.base64)
            g.push('base64');
        this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
        return this;
    },
    // Deliver
    pay : function() {
        return this.lead + this.buf;
    },
    // UTF-16BE
    utf16be : function(t) { // U+0500 => %05%00
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            } else {
                buf.push('00');
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-16LE
    utf16le : function(t) { // U+0500 => %00%05
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
            } else {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push('00');
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-8
    utf8 : function(t) {
        this.buf += encodeURIComponent(t);
        return this;
    },
    // Direct copy
    copy : function(t) {
        this.buf += t;
        return this;
    }
};

###Previous answer:


As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

You can find a fiddle here:

Unicode export sample Fiddle

Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.


Typically use like:

// Initiate
encoder = new DataEnc({
    mime   : 'text/csv',
    charset: 'UTF-16BE',
    bom    : true
});

// Convert data to percent escaped text
encoder.enc(data);

// Get result
var result = encoder.pay();

There is two result properties of the object:

1.) encoder.lead

This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

data:[<MIME-type>][;charset=<encoding>][;base64]

You can specify base64, but there is no base64 conversion (at least not this far).

2.) encoder.buf

This is a string with the percent escaped data.

The .pay() function simply return 1.) and 2.) as one.


###Main code:


function DataEnc(a) {
    this.config(a);
    this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
        u8    : ['u8', 'utf8'],
        // RFC-2781, Big endian should be presumed if none given
        u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
        u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
        'none'     : '',
        'UTF-8'    : '%ef%bb%bf', // Discouraged
        'UTF-16BE' : '%fe%ff',
        'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
    // Basic setup
    config : function(a) {
        var opt = {
            charset: 'u8',
            mime   : 'text/csv',
            base64 : 0,
            bom    : 0
        };
        a = a || {};
        this.charset = typeof a.charset !== 'undefined' ?
                        a.charset : opt.charset;
        this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
        this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
        this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;

        this.enc = this.utf8;
        this.buf = '';
        this.lead = '';
        return this;
    },
    // Create lead based on config
    // data:[<MIME-type>][;charset=<encoding>][;base64],<data>
    intro : function() {
        var
            g = [],
            c = this.charset || '',
            b = 'none'
        ;
        if (this.mime && this.mime !== '')
            g.push(this.mime);
        if (c !== '') {
            c = c.replace(/[-\s]/g, '').toLowerCase();
            if (DataEnc._enctype.u8.indexOf(c) > -1) {
                c = 'UTF-8';
                if (this.bom)
                    b = c;
                this.enc = this.utf8;
            } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
                c = 'UTF-16BE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16be;
            } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
                c = 'UTF-16LE';
                if (this.bom)
                    b = c;
                this.enc = this.utf16le;
            } else {
                if (c === 'copy')
                    c = '';
                this.enc = this.copy;
            }
        }
        if (c !== '')
            g.push('charset=' + c);
        if (this.base64)
            g.push('base64');
        this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
        return this;
    },
    // Deliver
    pay : function() {
        return this.lead + this.buf;
    },
    // UTF-16BE
    utf16be : function(t) { // U+0500 => %05%00
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            } else {
                buf.push('00');
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-16LE
    utf16le : function(t) { // U+0500 => %00%05
        var i, c, buf = [];
        for (i = 0; i < t.length; ++i) {
            if ((c = t.charCodeAt(i)) > 0xff) {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
            } else {
                buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                buf.push('00');
            }
        }
        this.buf += '%' + buf.join('%');
        // Note the hex array is returned, not string with '%'
        // Might be useful if one want to loop over the data.
        return buf;
    },
    // UTF-8
    utf8 : function(t) {
        this.buf += encodeURIComponent(t);
        return this;
    },
    // Direct copy
    copy : function(t) {
        this.buf += t;
        return this;
    }
};

###Previous answer:


Bounty Awarded with 50 reputation awarded by jwl
Added section about sample text. Updated various parts. Added TL;DR
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34

DoI do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

##TheThis answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

###TL;DR:

The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.


PS: The UniSearcher is a nice tool to have available on this journey.

#The long way around

The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. TheIn this case the byte sequence should be that of UTF-8 as delivered from the script.

As an example if we take the script of jlarson it takes the data Arraydata Array:

Code-point  Glyph      UTF-8
----------------------------
    U+0500    Ԁ        d4 80
    U+05E1    ס        d7 a1
    U+0E01       e0 b8 81
    U+1054       e1 81 94

##Importing#By sample provided —, â€, “

We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

To quote Wikipedia on Windows-1252:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

##Retrieving the original bytes

To translate it back into it's original form we can look at the code page layout, from which we get:

Character:   <â>  <€>  <”>  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ>
U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153
T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c
  • U is short for Unicode
  • T is short for Translated

For example:

â => Unicode 0xe2   => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80

Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

set fenc=utf-16
# Or
set fenc=ucs-2

##Bytes to UTF-8

We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash)  —
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq)  “

Conclusion; The original UTF-8 string was:

—, ”, “

##Mangling it back

We can also do the reverse. The original string as bytes:

UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c

Corresponding values in cp-1252:

e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...

and so on, result:

—, â€, “

#Importing to MS Excel

From post the

Do not save the file with an recognized extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

As an example I savedsave the file as testfile"testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

Select encoding and proceed.

###Check that Excel and selected font actually supports the glyph

If support for the code points exist, the text should render fine.

##Why#Why it works (or should)

As we quickly register the escape sequences are equal to the oneones in outthe hex dump above:

or, testing a 4-byte code:

#If this is does not comply

Do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

##The long way around

The "easiest" way to be 100% sure is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. The byte sequence should be that of UTF-8 as delivered from the script.

As an example if we take the script of jlarson it takes the data Array:

U+0500  Ԁ  d4 80
U+05E1  ס  d7 a1
U+0E01  ก  e0 b8 81
U+1054  ၔ  e1 81 94

##Importing to MS Excel

From post the

Do not save the file with an recognized extension, like .csv, or .txt, but omit it completely or make something up.

As an example I saved the file as testfile, with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option.

Select encoding and proceed.

##Why it works (or should)

As we quickly register the escape sequences are equal to the one in out hex dump:

or:

I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

###TL;DR:

The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.


PS: The UniSearcher is a nice tool to have available on this journey.

#The long way around

The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.

As an example if we take the script of jlarson it takes the data Array:

Code-point  Glyph      UTF-8
----------------------------
    U+0500    Ԁ        d4 80
    U+05E1    ס        d7 a1
    U+0E01       e0 b8 81
    U+1054       e1 81 94

#By sample provided —, â€, “

We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

To quote Wikipedia on Windows-1252:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

##Retrieving the original bytes

To translate it back into it's original form we can look at the code page layout, from which we get:

Character:   <â>  <€>  <”>  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ>
U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153
T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c
  • U is short for Unicode
  • T is short for Translated

For example:

â => Unicode 0xe2   => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80

Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

set fenc=utf-16
# Or
set fenc=ucs-2

##Bytes to UTF-8

We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash)  —
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq)  “

Conclusion; The original UTF-8 string was:

—, ”, “

##Mangling it back

We can also do the reverse. The original string as bytes:

UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c

Corresponding values in cp-1252:

e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...

and so on, result:

—, â€, “

#Importing to MS Excel

Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

Select encoding and proceed.

###Check that Excel and selected font actually supports the glyph

If support for the code points exist, the text should render fine.

#Why it works (or should)

As we register the escape sequences are equal to the ones in the hex dump above:

or, testing a 4-byte code:

#If this is does not comply

It sounded a bit over-confident, so made it a bit more sane. + Added some more detail and a request.
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Loading
added 1144 characters in body
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Loading
added 1144 characters in body
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Loading
Post Undeleted by user13500
Post Deleted by user13500
added 375 characters in body
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Loading
Source Link
user13500
  • 3.9k
  • 2
  • 29
  • 34
Loading