Perl pack / unpack bits when it is not multiple by 8

Question

I have the following problem: decode a base64 string and decode some bits as integers, such as

first 6 bits (0 to 5) are the "version"
next 36 bits (6 to 41) are the "created epoch time"
etc

fields are stored in big-endian format. Bit numberings are left-to-right.

after trying several combinations, I manage to create a sequence of octets and use bitwise operations to find what I need, like in the example below.

for instance, for the version part, it is easy: since I am looking for the 6 first bits, I can perform a >> 2, however for the next 36 bits which is [000000xx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xx000000] starting at the 1st byte it is much more difficult

I'd like to know if there is a better way to do it. I manage to create an array of bits using unpack and manage to do a @bits[start..end], however I don't know how to continue.

My problem in specific is: I can port the bitwise operations done in another implementation handling an array of octets/bytes but I need add EXTRA operations to ensure the correct answer and it is a LOT of fields.

I never had to work with pack/unpack before and I want to avoid XS, doing everyting in pure perl (today we have a version that binds the golang library and I need to use CGO and the build process is really complex).

For instance, I find a python version that uses a module called bitarray, that simplifies a lot their work, however I did not find an equivalent in Perl.

my poc

use strict;
use warnings;
use feature 'say';

use MIME::Base64;

my $data = "COyiILmOyiILmADACHENAPCAAAAAAAAAAAAAE5QBgALgAqgD8AQACSwEygJyAAAAAA";

my @octets = unpack "C*", decode_base64($data);

my $version = $octets[0] >> 2;
say "version: $version";

my $deciseconds = unpack "Q>", pack "C*", (
    0x0, 
    0x0, 
    0x0, 
    (($octets[0] & 0x3) << 2 | $octets[1] >> 6) & 0xFF,
    ($octets[1]<<2 | $octets[2]>>6) & 0xFF,
    ($octets[2]<<2 | $octets[3]>>6) & 0xFF,
    ($octets[3]<<2 | $octets[4]>>6) & 0xFF,
    ($octets[4]<<2 | $octets[5]>>6) & 0xFF,
);
say "deciseconds: $deciseconds";

it prints, as expected

version: 2
deciseconds: 15880192742

For instance, this is the go equivalent for the deciseconds decoding

    var data []byte
    // decode base64 ...
    deciseconds := int64(binary.BigEndian.Uint64([]byte{
        0x0,
        0x0,
        0x0,
        (data[0]&0x3)<<2 | data[1]>>6,
        data[1]<<2 | data[2]>>6,
        data[2]<<2 | data[3]>>6,
        data[3]<<2 | data[4]>>6,
        data[4]<<2 | data[5]>>6,
    }))

Thanks

I think the perl equivalent of that python bitarray class (I didn't look too deeply at it) would be vec and bitwise string operators, fwiw. — Shawn
– Shawn, Commented Dec 1, 2023 at 14:54
Perhaps there was ways to use it, but one thing I note is vec only supports powers of two as number of bits and it drives me crazy — Tiago Peczenyj
– Tiago Peczenyj, Commented Dec 1, 2023 at 15:04
The vec function provides similar functionality as a Python bitarray — ikegami
– ikegami, Commented Dec 1, 2023 at 16:43

Shawn · Accepted Answer · 2023-12-01 16:07:39Z

One way using just pack and unpack and avoiding all the shifting is to use the B format to convert to and from a string of 0's and 1's, adding padding as needed:

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use MIME::Base64;

my $data = "COyiILmOyiILmADACHENAPCAAAAAAAAAAAAAE5QBgALgAqgD8AQACSwEygJyAAAAAA";

# Get the data as a string of 0's and 1's.
my ($bits) = unpack("B*", decode_base64($data));

# Add leading padding 0's to get 8 and 64 bit fields packed back into binary
# and then extract as numbers
my ($version, $deciseconds) =
  unpack("CQ>", pack("B8B64",
                     "00" . substr($bits, 0, 6),
                     ("0" x 28) . substr($bits, 6, 36)));


say "version: $version";
say "deciseconds: $deciseconds";

amazing and simple. of course some wrappers can be done to calculate the right leading padding 0s ... I will try it
FYI I decided to follow this approach for now, the result is on the distribution GDPR::IAB::TCFv2 on CPAN

ikegami · Accepted Answer · 2023-12-01 20:51:07Z

First, it helps to visualize

Octets:
  0   1   2   3   4   5   6   7   8
+---+---+---+---+---+---+---+---+---+
|V/D| D | D | D | D |D/E| E | E | E |
+---+---+---+---+---+---+---+---+---+

|---|                                 C
|-------------------------------|     Q>
|-------------------|---------------| x5 L>

Bits of octet 0:
+---+---+---+---+---+---+---+---+
| V | V | V | V | V | V | D | D |
+---+---+---+---+---+---+---+---+

Bits of octet 5:
+---+---+---+---+---+---+---+---+
| D | D | E | E | E | E | E | E |
+---+---+---+---+---+---+---+---+

You're doing way more operations than needed.

my @octets = unpack "C*", $data;

my $version = $octets[0] >> 2;

my $deciseconds =
   ( ( $octets[0] & 0x03 ) << ( 8*5-6 )
   | ( $octets[1]        ) << ( 8*4-6 )
   | ( $octets[2]        ) << ( 8*3-6 )
   | ( $octets[3]        ) << ( 8*2-6 )
   | ( $octets[4]        ) << ( 8*1-6 )
   | ( $octets[5]        ) << ( 8*0-6 )
   );

my $epoch =
   ( ( $octets[5] & 0x3F ) << ( 8*3-0 )
   | ( $octets[6]        ) << ( 8*2-0 )
   | ( $octets[7]        ) << ( 8*1-0 )
   | ( $octets[8]        ) << ( 8*0-0 )
   );

A right shift by some amount can be written as a left shift by the negative of that amount. I took advantage of this for consistency.

That's actually a more general solution than needed here.

For each value,

Unpack a block that includes the value of interest.
- We can use unpack "C" on a substring of the packed data for values that are found entirely within 1 byte.
- We can use unpack "L>" on a substring of the packed data for values that are spread across no more than 4 bytes.
- We can use unpack "Q>" on a substring of the packed data for values that are spread across no more than 8 bytes.
- The x unpack format can be used in lieu of substr.
Correct the value with a shift and a mask.
- The shift can be calculated from counting the trailing bits, or by subtracting the leading bits and the relevant bits from the total bits.
- The mask is equal to ( 1 << $size ) - 1.

So,

The version value is found entirely within 1 byte starting at offset 0, so we can use C. We shift by 2, and the mask is 0x3F but unneeded.
The deciseconds value is spread across no more than 8 bytes starting at offset 0, so we can use Q>. We shift by 6+8+8 = 64-(6+36) = 22. We mask by ( 1 << 36 ) - 1 = 0xF_FFFF_FFFF.
The epoch value is spread across no more than 4 bytes starting at offset 5, so we can use x5 L>. We shift by 0 = (8*5+32)-(6+36+30) = 0. We mask by ( 1 << 30 ) - 1 = 0x3FFF_FFFF.

my ( $version ) = unpack "C", $data;
$version >>= 2;

my ( $deciseconds ) = unpack "Q>", $data;
$deciseconds = ( $deciseconds >> 22 ) & 0xF_FFFF_FFFF;

my ( $epoch ) = unpack "x5 L>", $data;
$epoch &= 0x3FFF_FFFF;

Simplified my own version and provided a much shorter version.
I did not understand the >> 22 in the second version. Thinking in how can I add another 30 bit “epoch” in sequence…
You have the 6 from octet 5, the 8 from octet 6, and the 8 from octet 7. /// I don't know what you mean by the second sentence.

Collectives™ on Stack Overflow

Perl pack / unpack bits when it is not multiple by 8

2 Answers 2

2 Comments

10 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

10 Comments

Related