github soundcloud
bits
Sep 1, 2019
6 minutes read

Understanding character encoding requires a firm grasp on bits and bytes. In this post I will try to make it as clear as possible how ASCII and UTF-8 works by doing it by hand

It can be thought of as a follow-up to the excellent What every programmer absolutely, positively needs to know about encodings and character sets to work with text.

Let’s play around with the letter “a”:

printf "a" > a.txt

In this case $LANG is set to UTF-8, so the bytes being written to the file will follow the rules of UTF-8. In other words, “a” will be encoded to bytes by following the UTF-8 standard. When we use cat, we will see the bytes again interpreted as UTF-8:

cat a.txt
a

So that is just an UTF-8 interpretation of the file. But which bytes does the file really contain? With xxd we can make a binary dump:

xxd -b a.txt
00000000: 01100001                                               a

In UTF-8, “a” is 8 bits (1 byte). Let’s try another kind of dump - the hexadecimal dump - or hexdump:

xxd a.txt
00000000: 61                                       a

Now you are seeing 61 - which is the hexadecimal representation of 01100001.

You may not know what a hexdump is or how to interpret hexadecimal numbers or how to count with them, but the most simple facts are:

  • Hex means 6 (think hexagon)
  • Decimal means 10 (think decilitre)

In the context of hexadecimal, decimal means we have the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. These are 10 symbols. Hexa tells us we have 6 other symbols - A, B, C, D, E and F. Hexadecimal is also called “base 16”. Hexadecimal is a very nice way of counting computer data. I will not explain how it works, because LiveOverflow and David J. Malan have great videos on it already.

Let’s try to write a hexdump manually to create a file that will show an “a” if interpreted as UTF-8:

echo "0: 61" | xxd -r
a

Let’s get a xxd formatted hexdump by writing a hexdump ourselves:

echo "0: 61" | xxd -r | xxd
00000000: 61                                       a

We can convert it to binary like so:

echo "0: 61" | xxd -r | xxd -b
00000000: 01100001                                               a

Can you write your name using this method? You can use man ascii to figure out what hexadecimals you have to use. Hey, isn’t it fun being 4 years old again?

printf '41 6e 64 65 72 73' | xxd -revert -plain
Anders

Let’s try some slicing and dicing with offsets and the dd utility. First, make a new file:

printf "cat dog giraffe lion monkey bird" > animals.txt

How does it look?

xxd animals.txt
00000000: 6361 7420 646f 6720 6769 7261 6666 6520  cat dog giraffe
00000010: 6c69 6f6e 206d 6f6e 6b65 7920 6269 7264  lion monkey bird

How can we get lion from this? We know the “l” is at offset 0000010 right? Let’s use dd and use a block size of 1 byte. Then we skip the first 16 bytes (0000010 in hexadecimal is 16 in decimal).

dd if=animals.txt bs=1 skip=16
lion monkey bird

Wow! With ASCII, each letter is 1 byte, so we need 4 bytes to catch the lion:

dd if=animals.txt bs=1 skip=16 count=4
lion

Unstoppable!

Figure 1: Lions are pretty cool.

Figure 1: Lions are pretty cool.

We could also use xxd in a roundabout way:

xxd -seek 16 -len 4 -plain animals.txt | xxd -revert -plain
lion

If you don’t understand what I am doing here, you should remove parts of the pipeline to reveal the data.

Now that you know how to write bits by hand, I recommend opening a file and activate hexl-mode in Emacs.

Fun fact: Some people use xxd to get a poor man’s hex editor inside Vim by dumping and reverting the whole buffer by using %!xxd.

How about Windows vs. Unix newlines? Those things are annoying. Could you convert them by hand, instead of using dos2unix and linux2dos? I’ll leave it up to you.

Python bonus round

Writing your name:

print(bytes.fromhex('61 6e 64 65 72 73').decode("UTF-8"))
anders

With the \x escape sequence:

print(b"\x61\x6e\x64\x65\x72\x73".decode("UTF-8"))
anders

Catching the lion:

with open("animals.txt", "rb") as binary_file:
    binary_file.seek(16)
    lion = binary_file.read(4)
    print(lion.decode("UTF-8"))
lion

The Python bytes type can be created from ASCII characters or hex escape sequences, so all of these are the same:

a = b"\x61\x6e\x64\x65\x72\x73"
a2 = b"anders"
a3 = b"a\x6e\x64\x65\x72\x73"
print(a == a2 == a3)
True

The bytes type is immutable, so we need to use the bytearray class to modify sequences of bytes.

a = bytearray(b"anders")

A bytearray is a sequence of integers (0-255), so the bytearray above looks like this:

a[0] a[1] a[2] a[3] a[4] a[5]
97 110 100 101 114 115

To modify a single element in the bytearray we have to pass a decimal value. We can use ord() to convert “A” to decimal:

a = bytearray(b"anders")
a[0] = ord(b"A")
print(a)
bytearray(b'Anders')

To go from decimal to hex:

print(hex(65))
0x41

When slicing, we get a bytearray back:

a = bytearray(b'ANDers')

print(a[0:3])
bytearray(b'AND')

So to replace we don’t use decimals:

a = bytearray(b"anders")
a[0:3] = b"AND"
print(a)
bytearray(b'ANDers')

How do we write these things to files?

with open("anders.txt", "wb") as f:
    # "and" interpreted as ASCII and the rest is interpreted as hexadecimal
    f.write(b"and\x65\x72\x73")

Let’s read the file again:

with open("anders.txt", "rb") as f:
    print(f.read())
b'anders'

When we read and open in binary mode (wb/rb) it means that we will be reading and writing with bytes (b"anders").

It’s possible to create a file-like object in memory by using io.BytesIO. You might want to do this when you have some library that wants to write binary data to a file. You could for example generate a plot with matplotlib and add it to a PDF.

To “emulate” the f variable from above, we could do this:

import io

f_in_memory = io.BytesIO()

f_in_memory.write(b"and\x65\x72\x73")
f_in_memory.seek(0)

print(f_in_memory.read())

f_in_memory.close()
b'anders'

If you don’t seek to the beginning (0), you would get an empty value here.

Another cool alternative is to use SpooledTemporaryFile that uses the BytesIO or StringIO up until a certain size:

import tempfile

with tempfile.SpooledTemporaryFile(max_size=100, mode="w+t", encoding="utf-8") as temp:

    print("temp: {!r}".format(temp))

    for i in range(3):

        temp.write("This line is repeated over and over.\n")

        print(temp._rolled, temp._file)
temp: <tempfile.SpooledTemporaryFile object at 0x1065837f0>
False <_io.TextIOWrapper encoding='utf-8'>
False <_io.TextIOWrapper encoding='utf-8'>
True <_io.TextIOWrapper name=3 mode='w+t' encoding='utf-8'>

Resources


Back to posts