Getting rid of binary strings, part 1

Every programmer is used to manipulating data at the byte level, but most rarely, if ever, have to directly manipulate bits. When such a situation arises, the “obvious” way to do it (when it comes to beginners) is to use binary strings, that is, strings of ones and zeroes. The manipulations are done using the usual string concatenation, substr and other basic string manipulation tools. The string is then finally parsed 8 characters at a time into bytes (using the language’s equivalent of parseInt()). In many languages, including Java and Javascript, strings are immutable, so any concatenation causes the string to be copied all over, everytime.

Any half decent programmer would feel “dirty” writing such code, but my feeling is that most would do it anyway and say something like “it’s fast enough”, “it’s more readable”, “I’ll optimize it later if it’s too slow, but that’ll do for now” to justify themselves. Until relatively recently, I used to be one of them and hopefully you will know better than using binary strings after reading this. I’ll be using Javascript, but this article is language-agnostic.

Partial bytes are often used in compression and stream algorithms.

Appending bits
By far the most common operation performed on them is to append more binary data, such as:

var arr = [],
    bin = "";

bin += "101110";
bin += "0101";
bin += "0000";
bin += "1100110011";

//Pad the last byte if needed
if (bin.length % 8 !== 0){
    bin += Array(8 - bin.length%8 +1).join("0");
}

//Parse the string
for (var i=0, len=Math.ceil(bin.length/8);i<len;i++){
    arr.push(parseInt(bin.substring(i*8, (i+1)*8), 2));
}

console.log(arr); //Prints [185, 67, 51]


 
Let’s replicate that example using bit manipulations. I’ll need some kind of buffer to hold incomplete bytes. A long in most languages is ideal due to being 64 bits, but I’ll be using Javascript’s 32-bit integers.

var arr = [], //List to store the full bytes
    buffer = 0; //Buffer to store incomplete bytes

//Make space for the new bits
buffer = buffer << 6;
//Add them
buffer = buffer | 46; //101110 = 46

//0101 = 5
buffer = (buffer << 4) | 5;
//There's now 10 bits in the buffer
//Put the 8 left-most bits in the list
arr.push(buffer >>> 2);
//Only keep the last 2 bits in the buffer
//This builds a mask to drop everything except the last 2 bits
//    1011100101
//    BINARY AND
//    0000000011
// => 0000000001
buffer &= (1 << 2) - 1;

//0000 = 0
buffer = (buffer << 4) | 0;

//1100110011 = 819
buffer = (buffer << 10) | 819;
//There's now 16 bits in the buffer
//Load the left-most 8 bits
arr.push(buffer >>> 8);
//Remove them
buffer &= (1 << 8) - 1;
//Load the last 8
arr.push(buffer);

console.log(arr);
//Prints [185, 67, 51]

Here’s how the buffer int and arr list look like throughout the above example:

[]
00000000 00000000 00000000 00000000

[]
00000000 00000000 00000000 00101110

[]
00000000 00000000 00000010 11100101

[10111001]
00000000 00000000 00000000 00000001

[10111001]
00000000 00000000 00000000 00010000

[10111001]
00000000 00000000 01000011 00110011

[10111001, 01000011, 00110011]
=> [185, 67, 51]

As soon as the buffer contains more than 7 bits, I empty as many full bytes as possible from it, starting from the left. An obvious limitation of this method is that I can’t concatenate more than 25 bits at a time (32-7, or 57 with 64 bits integers). However, it is more than enough for the vast majority of applications and the function can be called multiple times anyway.

Let’s abstract that algorithm into a function.

var arr = [],
    buffer = 0,
    nbBits = 0;
function appendBits(val, nb){
    buffer = (buffer << nb) | val;
    nbBits += nb;
    while (nbBits >= 8){
        arr.push(buffer >>> (nbBits - 8));
        buffer &= (1 << (nbBits - 8)) - 1;
        nbBits -= 8;
    }
}
appendBits(46, 6); //101110
appendBits(5, 4); //0101
appendBits(0, 4); //0000
appendBits(819, 10); //1100110011

console.log(arr); //[185, 67, 51]

 
How much faster is it?

for (var i=0;i<1000000;i++){
    bin += "101110";
    bin += "0101";
    bin += "0000";
    bin += "1100110011";
}
for (var i=0, len=Math.ceil(bin.length/8);i<len;i++){
    arr.push(parseInt(bin.substring(i*8, (i+1)*8), 2));
}

VS

for (var i=0;i<1000000;i++){
    appendBits(46, 6); //101110
    appendBits(5, 4); //0101
    appendBits(0, 4); //0000
    appendBits(819, 10); //1100110011
}

On my i7-3770K with Node v0.10.4, binary strings take 600ms to concatenate and parse 4 million short binary strings (for a total of 3,000,000 bytes). The same test takes 135 milliseconds with the function. Binary strings run out of memory if I try with 40 million concatenations, the function simply takes 10 times longer, as expected.

In part 2, I’ll be prepending bits, it’s quite a bit (pun intended) trickier!

4 thoughts on “Getting rid of binary strings, part 1

  1. Hi, i’m new at programming.. can you explain the situations where theses “binary strings” are useful.

    i don’t see really where all this can be useful!

    • Beginners often prefer binary strings because they can use the string manipulation tools like substring, split, concatenation, etc. that they are used to. It’s better to use your language’s specific binary manipulation tools like bitfields if they exist. Otherwise, learn the binary manipulations and do it the proper way, never use binary strings.

      In case you are asking why it could be useful to manipulate raw bits like that, it’s because several compression and hashing algorithms require working on raw ones and zeroes.

    • Endianness applies to multibyte data. Those lists only contain single bytes (0-255) so it’s not a problem. If you’re using Node.JS, you can then load them into a buffer (var buf = new Buffer(arr);) and write the data to the disk.

Leave a Reply