[OpenSSL] Decoding Base 64 with OpenSSL

Base 64 is a form of encoding that converts groups of 24 bits to four ASCII characters. The details of base 64 encoding is detailed on the Base 64 (Wikipedia) page. The information on that page should be enough for you to write your own base 64 decoder. Alternatively, we can use OpenSSL to do it for us.

The OpenSSL documentation on BIO_f_base64 actually provides an example of how to use the library to perform the decoding with the library. The example on that page:

 
    BIO *bio, *b64, *bio_out;  
    char inbuf[512];  
    int inlen;  

    b64 = BIO_new(BIO_f_base64());  
    bio = BIO_new_fp(stdin, BIO_NOCLOSE);  
    bio_out = BIO_new_fp(stdout, BIO_NOCLOSE);  
    bio = BIO_push(b64, bio);  

    while((inlen = BIO_read(bio, inbuf, 512)) > 0)         
        BIO_write(bio_out, inbuf, inlen);

In the example, three BIOs are created – for decoding, outputting to standard output and another for taking in encoded data. The call to BIO_push creates a BIO chain (it adds the input BIO to the base 64 BIO) and then we loop until all of the decoded data is read. In each iteration, the decoded data is, first, read into the character array inbuf before it is outputted to standard output by writing to the BIO bio_out. In the example, each chunk of decoded data is outputted to standard output, but it is likely that you would want to store the decoded data in memory for some sort of processing. The most obvious way of doing this is to allocate memory to hold the entire decoded data and append each decoded chunk of data during each iteration of the loop.

If you are using C++, you can easily use C++ string to hold the decoded data and append during each loop. If you are using just C, one possibility around this problem is to allocate some initial memory and then use realloc to resize the memory during each iteration of the loop to hold in more and more data. But this seems rather inefficient because of the need to increase the size of the allocated memory. Note that realloc could also have to move the memory. We could avoid having to resize the memory during each iteration by allocating the memory before reading back the decoded data. Another inefficient method is to perform the loop twice – once to determine the length of the decoded data and another time to store the decoded data. It is more efficient to simply allocate the memory first and then read through the loop once, but how do you know the length of decoded data?

To determine exactly how long the decoded data should be, you have to go back to how base 64 encoding works – a group of 24 bits is mapped to a group of four ASCII characters. Note that valid base 64 encoded data will always have a length that is divisible by four. If the length of base 64 encoded data is not divisible for four, there is something wrong with the encoded data. For are given encoded data with x number of ASCII characters, the number of bits, y, in the decoded data is found as follows:

        24 bits = 4 ASCII characters 
        y bits = x ASCII characters 
        y = (24 * x) / 4 = (6 * x) bits

Next, we have to account for padding. Padding is present when the last group of 24 bits in the data that was encoded did fully occupy 24 bits. The base 64 standard states that there are three possibilities for padding – there will be either no, only one or only two padding characters (see the bottom of page 5 in RFC 4648). Also, note that the standard uses the equal sign (“=”) as the padding character. For each instance of this padding character at the end of the encoded data, subtract eight bits from y to get the final expected decoded length.

Now that the calculation has been explained, we can turn this into code:

     char *encoded = "WU9ZTyEA";
     int result = 0;
     int padding = 0;
     int strLength = strlen(encoded);

     // Check that the string is not empty and that the length is a multiple of four.
     if ((strLength > 0) && ((strLength % 4) == 0))     
     {
         // First, we check if the last character is padding.
         if (encoded[strLength - 1] == PADDING_CHAR)
         {
             padding++;

             // The second last character could also be padding!
             if (encoded[strLength - 2] == PADDING_CHAR)
             {
                 padding++;
             }
         }

         // Now that we know the amount of padding, we can caculate the expected
         // length. If groups of 24 bits (3 characters) get encoded into 32 bits
         // (4 characters) ...
         result = (3 * strLength) / 4;

         // Accounting for the padding:
         result = result - padding;
     }
     else
     {
         printf("Either there is no data to decode or its length is incorrect.\n");
     }

Note one important assumption was made in this code … that one char occupies 8 bits!

There is also one thing that I have come across when using OpenSSL to perform base 64 decoding. If the encoded data is greater than 64 characters long, I have had to insert a newline character after every 64 characters. This could be because OpenSSL mainly deals with the PEM format, which uses base 64 encoding. The PEM format uses line lengths of 64 characters (see RFC 4648, section 3.1). I have also had to make sure that the last line of the encoded data has a newline character at the end.