Technical Information

 


Methodology

Traditionally compression algorithms have used a dictionary format to compress data, meaning that data would be scanned to identify repetitions existing within the data (in bytes) and a dictionary created to assign values to the most widely used repetitious data and compression would follow. The higher the repetition the more compression a file would achieve. The decoding data would then be stored in either the header (beginning of the file) or trailer (end of the file), allowing you to decompress data at a later date. Another way is based on character frequencies, as with Huffman's coding, in which more frequently used characters are coded with less bits and vice versa, as against ASCII coding, in which all characters are represented by the same number of bits.

What we have succeeded in doing is looking at data in a completely new format.

Signals have been created, using the binary data already available within the file to allow the separation of couplets (2 bits) into two subcategories, namely the control and the subordinate sections, minimizing the overhead needed to encode the data. Once the data is split we are then able to use the signals we have created to represent a variety of data which shows us what the full couplets should be. The data is classified into five modes, each characterized by its particular bit pattern; and different signals are created to indicate what mode a particular bit string in the subordinate section belongs to. The file thus compressed allows us to recreate all data when it is decompressed to achieve lossless compression (the ability to completely restore a file without any loss of quality). The compression can also be applied multiple times to the same file allowing even greater compression: recursive compression!

There are several ways to implement the technology. For instance, a piece of data can be processed from both ends: say, the control section from the left and the subordinate section from the right, making it unnecessary to create a dividing barrier between the two sections. Alternatively, a signal can be used as the dividing barrier, and data can be processed in one direction, say, from left to right in the control section, then in the subordinate section, alternately by reading the signal in the control and the Least Significant Bits(LSB: refer to Achieving Recursive Compression of Random Data for details) controlled by that signal in the subordinate section. Furthermore, two files can be created to process the control section and the subordinate section separately, and merged at the decoding end. Apart from the first method, with skillful buffering, the last two can achieve real time transmission. Theoretically, even here, data can go through multiple passes by placing corresponding filters at both the transmission and the recipient ends.


Mathematical Theory supporting the Technology


Apparent impossibility to compress random data:

If a piece of random data is divided into sections of equal lengths, say, one byte, and each section is to be compressed to 7 bits, then the compressed sections can only represent half the values of the uncompressed sections. This is the only sensible way to compress random data, because, in decoding, each seven bits can be decoded back to a byte, if it works. Unfortunately it cannot work, as half the information cannot be represented with seven bits. In other words, only 128 values of the 256 values represented by one byte can be represented by seven bits; and the other 128 values has to be left as they are, that is, as eight bits. However, here is the rub, for in decoding, it cannot be identified which seven bits were compressed from eight bits and which eight bits were uncompressed eight-bit sections, unless a marker is used between every two sections at the compressed level. Alas, markers would be a lot longer than whatever compression achieved. So the argument goes.


In mathematical terms, the coding of random data cannot be shorter than its entropy:

 

In other words, if a random variable is distributed over 32, there will be 32 or 2^5 possibilities. Hence the entropy of 8 bits is:

 

Thus, what stands in the way of compression of random data is this entropy factor. If a piece of data is divided to words of 8 bits, then there will be 256 different words in the dictionary, or 256 letters in the alphabet, each has equal probability to occur and therefore 8 bits are needed to encode all the words, as the above formulas indicate. Therefore, if each of the 256 different 8-bit words is to be translated to a unique 7-bit word so as to achieve compression, only half of the 8-bit words can be translated, as there can only be 128 different 7-bit words. It is like trying to translate a dictionary of 1000 words to one of 500 words on a one-to-one basis. And here is the point of departure. The method deployed in this compression technology does not depend on compressing each word of a fixed length to a word of a shorter fixed length, as explained below.

If some sections of eight bits of random data can be converted to seven bits, some to six bits, some to five bits, and so on, then the sum total of all these unequal sections would be only two less than the values represented by the pre-compression binary data. For instance, in the eight bits to seven bits compression process, 00000001 can only converted to 0000001. In comparison, in the process of converting eight bits to unequal sections, the same 00000001 can be converted to up to seven different sections: 0000001, 000001, 00001, 0001, 001, 01 and 1, making the six extra unused sections available to represent values, which could not be otherwise represented by uniform-length sections shorter than eight bits. It can also remain unchanged as 00000001. Thus seven more values are created from one single value, and thus all values represented by one byte can be represented by sections of less or equal bits. In mathematical terms, the equation can be expressed as a finite geometric progression:

where y = number of bits and k = exponent of 2

This all sounds plausible enough; but in coding, how could the computer know how many bits to decode each time back to eight bits? The only way to get around this difficulty is to place a marker between two compressed sections, and the problem with this measure has already been pointed out above.


Theoretic solutions provided by application of the present technology:

To put it in a nut shell, what this technology has done is simply converting longer sections, not necessarily of equal length, to shorter sections, not necessarily of equal length, which can be decoded back to the original sections without the need for markers. How can this be done? It is done by splitting the random data into two sections, the control and the subordinate. Signals in the control sections determine how many bits in the subordinate section in conjunction with the signals themselves are to be decoded back to the original section of the random data. Thus, unlike the markers discussed above in Apparent impossibility to compress random data, signals in control section do not cause growth, as they are themselves derived from and part of the uncompressed random data. As a matter of fact, the signals themselves can be replaced with shorter signals, resulting in even greater compression.

While the technology does not process the random data in fixed length sections, the result can be evaluated by measuring sections of the same length, say, eight bits, with the length of their corresponding sections at the compressed level. It would then become obvious that each section of that length have virtually been converted to unequal sections of shorter lengths at the compressed level. As the compressed data is still random binary, the process can be repeated again, and again.


Achieving Recursive Compression of Random Data

Our ability to compress Random Binary allows us to also achieve recursive compression, the ability to do it again and again. How do we do this?

First, we look at binary information at a closer range; in stead of one byte or one nibble (4 bits) at a time, we look at a couplet (2 bits) at a time. We then split each couplet into two single bits, the Most Significant Bit (MSB) on the left and the Least Significant Bit (LSB) on the right. The former is allocated to the control section to form signals with the MSB of other couplets; and the latter goes to the subordinate section. As already explain in Methodology above, according to our methodology, all patterns of binary data can be classified into five modes, which are indicated by signals in the control section. We shall give an example of how our technology works with the simplest mode, namely the base-0 mode, in which all couplets have 0 MSB.

Say we have 3 bytes of information, and for arguments sake it is as follows.



00000000 00000000 00000001


This represents 24 bits of data.

We can represent that as follows: break the bytes down into couplets, that is, two bits of information for one couplet. The left side of the couplet is the Most Significant Bit (MSB) and the right side is the Least Significant Bit (LSB).


00 00 00 00 00 00 00 00 00 00 00 01


MSB > 00 < LSB


Once the couplets are broken down to MSB and LSB, it allows us to split the whole data, adding a second dimension: so we do not look at binary anymore in a linear fashion.

In splitting the above data we use the following rule: for a couplet with 0 as its MSB and 0 as its LSB, the LSB only needs to be recorded in the subordinate section; whereas for a couplet with 0 as its MSB and 1 as its LSB, the MSB has to be recorded in the control section (Section 1) and the LSB in the subordinate section (Section 2).

Applying the above rule to our sample data, we achieve the following.


Example:

Section 1 | Section 2

0 | 000000000001


Compression achieved:

We started with 24 bits of data (3 bytes), in splitting the data and looking at it from a second dimension, those 24 bits of data have now become 13 bits of data.

If you apply the same rule to the data above then you can compress again: recursive compression.

A further example would show that another 24 bit data, say, 00 01 00 00 00 01 00 01 00 00 00 01, by means of the same two-dimension method as above, could be compressed to:


Section 1 | Section 2

0000 | 010001010001


The number of bits has now been reduced to 16 bits. Thus, one 24-bit section of data has been compressed to 13 bits as in example 1; whereas another section of the same length compressed to 16 bits as in the second example. In other words, the same number of bits in the raw data could be compressed to different lengths in the compressed data, which can be decoded without the need for artificial markers (extra bits) between compressed sections of unequal lengths. These concrete examples have demonstrated the theoretic feasibility advanced in Mathematical Theory supporting the Technology above.

To decode the above the examples, simply restore the missing 0s in the subordinate section by adding a 0 before each bit in that section.

The purpose of the above examples is to show that the same length of the original data can be compressed to different lengths and decoded back to the original length without the need for markers between compressed sections. This is a high level example of the foundations we use to achieve RBC, we have overcome numerous obstacles to achieve a similar outcome for any Random Binary.