Standard data correction methods takes some blocks of the data and enlarge them to protect against some specific set of errors. We loose whole block if we get out of this set.

Look at two well known data correction methods: Hamming 4+3 (to store 4 bits we use additional 3 bits) and tripling each bit (1+2).

Assuming the simplest error distribution model - for each bit the probability that it is switched is constant, let say e=1/100, we will have that

- the probability that in 7 bit block we have at least 2 errors, is 1 - (1-e)^7 - 7e(1-e)^6 =~ 0.2%

- for 3 bit block it's about 0.03%

So for each kilobyte of data we irreversibly loose: 4*4=16 bits in Hamming, 2.4 bits for tripling bits.

We see that even for looking to be well protected methods, we loose a lot of data because of pessimistic cases.

To be able to correct errors we are adding usually constant density of redundancy, but the density of errors usually isn't constant - can fluctuate - sometimes is above average, sometimes beyond.

When is above - we need a lot of redundancy to cope with these pessimistic cases.

When is beyond - we've used there more redundancy than it was really needed - we waste some capacity.

The idea is to somehow transfer these unused surpluses of redundancy to help with difficult cases.

Thanks of it instead of adding redundancy according to pessimistic error density, we could use for a bit above average density, which is usually a few orders of magnitude smaller.

To do this we cannot place data in independent blocks (like 7 bits in Hamming)- their redundancy is also independent.

We should use a coding which allow to hide redundancy in a way that each piece of it affects large part of the message.

I'll show now an example how it can be done.

Unfortunately it has various latency - simple errors can be quickly corrected, but more complicated may take long times... maybe there are better ways?

The idea is to use a very precise coding - such that any error would make that the following decoded sequence should be completely random bit sequence.

For example a block ciphers, which uses previous block to calculate the following one ... but there is much better coding for it I will say later about.

Now - add to the message some easily recognizable redundancy - for example insert '1' between each digit (each such bit is hidden in all succeeding bits of encoded message).

If while decoding it occurs that there is '0' in one of these places - that means we had some error before and most probably it's nearby.

Knowing statistical characteristics of expected errors, we can make list of most possible errors in such cases, sort them by their probability - on the top of this list there should be 'switched previous bit',... after a while there can appear 'switched two bits:...'. This list can be very large.

Now if we know that there (nearby) appeared some error, we take this list position by position, correct as it was really this case (switch some bits) and try to decode further fixed number of bits (a few dozens).

If everything is ok - we get only '1' on selected positions - we can assume that it was this error.

If not - try the next one from the list.

This list can be generated online - using large amount of time we could repair even badly damaged transmission.

While creating the list, we have to remember that errors can appear also in the succeeding bits.

Using block ciphers is a bit nasty - they are slow, we have large blocks to search for errors ...

There is new coding just ideal for above purpose - Asymmetric Numeral Systems (ANS)- new entropy coder (http://www.c10n.info/archives/720), which has very nice properties for cryptography ... and data correction - it's much faster than block ciphers and uses small blocks of various length.

Here for example is demonstration about it:

http://demonstrations.wolfram.com/DataCompressionUsingAsymmetricNumeralS...

We can use ANS entropy coding property to make above process quicker and distribute redundancy really uniformly:

to create easily recognizable pattern, instead of inserting '1' symbol regularly, we can add a new symbol - the forbidden one. If it occurs, we know that something was wrong, the nearer the more probable.

Let say we use symbols with some probability distribution (p_i), so we at average need H = -sum_i p_i lg p_i bits/symbol.

For example if we want just to encode bytes without compression, we can threat it as 256 symbols with p_i=1/256 (H = 8 bits/symbol).

Our new symbol will have some chosen probability q. The nearer to 1 it is, the larger redundancy density we add, the easier to correct errors.

We have to rescale the rest of probabilities: p_i ->(1-q) p_i.

In this way, the size of the file will increase r = (H - lg (1-q) )/H times.

Now if while decoding we get the forbidden symbol, we know that,

- with probability q, the first uncorrected yet error has occurred in some of bits used to decode last symbol,

- with probability (1-q)q it occurred in bits used while decoding the previous symbol,

- with probability (1-q)^2 q ...

The probability of succeeding cases drops exponentially, especially if (1-q) is near 0.

But the number of required tries also grows exponentially.

But observe that for example all possible distributions of 5 errors in 50 bits is only about 2 millions - it should be checked in a moment.

Using redundancy like in Hamming(4+3) or tripling (1+2):

4+3 case (r=7/4) - we add forbidden symbol with probability q=1-1/2^3=7/8, and each of 2^4=16 symbols has probability 1/16*1/8=1/128.

In practice ANS works best if lg(p_i) aren't natural numbers, so q should (not necessary) be not exactly 7/8 but something around.

Now if the forbidden symbol occurs, with probability about 7/8 we only have to try to switch one of (about) 7 bits used to decode this symbol.

With 8 times smaller probability we have to switch 7 bits from the previous one... with much smaller probability, depending on the error density model, we should try to switch some two bits ... and even extremely pessimistic cases looks to take reasonable time to correct them.

For 1+2 case (r=3), the forbidden symbol has about 3/4, and 0,1 has 1/8 each.

With probability 3/4 we have only to correct one of 3 bits ... with probability 255/256 one of 12 bits ...

## Recent comments