Octals 302 240 together seem to correspond to non-breaking space

Question

By looking at a particular line of a text file (say, the 1123th, see below), it seems that there is a non-breaking space, but I am not sure:

$ cat myfile.csv | sed -n 1123p | cut -f2
Lisztes feher

$ cat myfile.csv | sed -n 1123p | cut -f2 | od -An -c -b
   L   i   s   z   t   e   s 302 240   f   e   h   e   r  \n
 114 151 163 172 164 145 163 302 240 146 145 150 145 162 012

However, the ASCII code in octal indicates that a non-breaking space is 240. So what does the 302 correspond to? Is it something particular to this given file?

I am asking the question in order to understand. I already know how to use sed to fix my problem, following this answer:

$ cat myfile.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b
   L   i   s   z   t   e   s       f   e   h   e   r  \n
 114 151 163 172 164 145 163 040 146 145 150 145 162 012

For information, the original file is in the .xlsx (Excel) format. As my computer runs Xubuntu, I opened it with LibreOffice Calc (v5.1). Then, I saved it as "Text CSV" with "Character set = Unicode (UTF-8)" and tab as field separator:

$ file myfile.csv
myfile.csv: UTF-8 Unicode text

Did you really have to use octal? Hex is many times easier to work with. I haven't used octal since 1979. — user207421
– user207421, Commented Mar 26, 2016 at 1:41
@EJP I used octal because initially I had a hard time understanding the output of od -x, which is often advised on Internet but corresponds to hexadecimal 2-byte units. But after your comment, by re-reading the man page of od more thoroughly, I realized that using od -c -t x1 was indeed maybe better. — tflutre
– tflutre, Commented Mar 29, 2016 at 17:27
@user207421 : have you ever looked at the UTF-8 multi-byte encodings ? for a sizable portion of code points, simply concat the 2nd and 3rd octal digit for each byte, and that's usually the octal code of the underlying code point itself. it's far easier to read than a bunch of hex, esp for UTF-8. The 1st octal digit is also extremely informative - most 3s represent a multi-byte leading byte, 2 is a continuation byte ( the 0x80-0xBF), and either 1 or 0 is good ole ASCII, with all the letters, both upper case and lower case, in the 1xx zone. — RARE Kpop Manifesto
– RARE Kpop Manifesto, Commented Jul 28, 2022 at 1:28
- that's so much cleaner than 7F 80 BF C2 F5. I use hex for the code points, but then use octals for the bytes themselves to make a clean differentiation in my workflows. — RARE Kpop Manifesto
– RARE Kpop Manifesto, Commented Jul 28, 2022 at 1:28

Stéphane Chazelas · Accepted Answer · 2016-03-25 17:39:14Z

It's the UTF-8 encoding of the U+00A0 Unicode character:

$ unicode U+00A0
U+00A0 NO-BREAK SPACE
UTF-8: c2 a0 UTF-16BE: 00a0 Decimal: &#160; Octal: \0240
 
Category: Zs (Separator, Space)
Bidi: CS (Common Number Separator)
Decomposition: <noBreak> 0020

$ locale charmap
UTF-8
$ printf '\ua0' | od -to1
0000000 302 240
0000002

UTF-8 is an encoding of Unicode with a variable number of bytes per character. Unicode as a charset is a superset of iso8859-1 (aka latin1) itself a superset of ASCII.

While in iso8859-1, the non-breaking-space character (codepoint 0xa0 in iso8859-1 like in Unicode) would be expressed as a one 0xa0 byte, in UTF-8, only code points 0 to 127 are expressed as one byte (which makes UTF-8 a superset of ASCII or in other words ASCII files are also UTF-8 files).

Code points over 128 are encoded with more bytes per characters. See Wikipedia for details of the UTF-8 encoding algorithm.

Archemar · Accepted Answer · 2021-04-27 14:55:10Z

0

302 240 is the combination of Alt-Gr + space.

On a french keyboard when you want to type a space after a | it's easy to type Alt-gr + | Alt-gr + space when you wanted Alt-gr + | space and then you get an error.

edited Apr 27, 2021 at 14:55

Archemar

32.3k18 gold badges75 silver badges107 bronze badges

answered Apr 26, 2021 at 8:50

mla

1

Add a comment |

Stack Exchange Network

Octals 302 240 together seem to correspond to non-breaking space

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Octals 302 240 together seem to correspond to non-breaking space

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions