By looking at a particular line of a text file (say, the 1123th, see below), it seems that there is a non-breaking space, but I am not sure:
$ cat myfile.csv | sed -n 1123p | cut -f2
Lisztes feher
$ cat myfile.csv | sed -n 1123p | cut -f2 | od -An -c -b
L i s z t e s 302 240 f e h e r \n
114 151 163 172 164 145 163 302 240 146 145 150 145 162 012
However, the ASCII code in octal indicates that a non-breaking space is 240. So what does the 302 correspond to? Is it something particular to this given file?
I am asking the question in order to understand. I already know how to use sed to fix my problem, following this answer:
$ cat myfile.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b
L i s z t e s f e h e r \n
114 151 163 172 164 145 163 040 146 145 150 145 162 012
For information, the original file is in the .xlsx (Excel) format. As my computer runs Xubuntu, I opened it with LibreOffice Calc (v5.1). Then, I saved it as "Text CSV" with "Character set = Unicode (UTF-8)" and tab as field separator:
$ file myfile.csv
myfile.csv: UTF-8 Unicode text
od -x, which is often advised on Internet but corresponds to hexadecimal 2-byte units. But after your comment, by re-reading the man page ofodmore thoroughly, I realized that usingod -c -t x1was indeed maybe better.UTF-8multi-byte encodings ? for a sizable portion of code points, simply concat the 2nd and 3rd octal digit for each byte, and that's usually the octal code of the underlying code point itself. it's far easier to read than a bunch of hex, esp for UTF-8. The 1st octal digit is also extremely informative - most 3s represent a multi-byte leading byte, 2 is a continuation byte ( the0x80-0xBF), and either 1 or 0 is good oleASCII, with all the letters, both upper case and lower case, in the 1xx zone.