DEV Community

Cover image for ASCII in NLP - NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

ASCII in NLP - NLP

ASCII (American Standard Code for Information Interchange) is a character encoding standard that represents 128 characters using 7 bits. These 128 characters include uppercase and lowercase letters, numbers, punctuation marks, and control characters.

While this is the technical definition (source: Google), let’s understand why ASCII is important in Natural Language Processing (NLP).

The Problem

Think of two situations:

  • Converting a number to binary
  • Converting text to binary

Converting numbers to binary is pretty straightforward:

5 in binary = 101
100 in binary = 1100100

But Converting text to binary add extra step. First convert to number and to binary.

But converting text to binary involves an extra step:

  • First, convert each character to a number (using encoding like ASCII)
  • Then, convert that number to binary

So yes, this is exactly what we’re doing in NLP and programming — and instead of assigning numbers ourselves, we use ASCII, which is a standardized encoding for characters.

Python Code Example:

print(ord('A'))    
print(ord('a'))    
print(ord('1'))    
print(ord(' '))    
print(chr(65))     
Output:
65
97
49
32
A

Here:

ord() gives the ASCII value (number) of a character
chr() gives the character from an ASCII value

Now you explain the what's happening here.

name = "John"
ascii_values = [ord(char) for char in name]
print(ascii_values)
[74, 111, 104, 110]

Top comments (0)