I'm working with an embedded project that uses a 4 kB buffer. Every once in a while something happens which introduces 10-15% bit errors, scattered throughout the buffer. Can someone please recommend some error correction techniques that could solve this? My microcontroller is clocked at 60 MHz and it's acceptable if the encoding/decoding takes hundreds of milliseconds, but not seconds. Ideally it should be a solution where there is free source code available.
- 
        \$\begingroup\$ You'd need to work on the non-corrupted data first. How do you know when it corrupts before applying error correction? Sounds like you are better off solving the problem than working around it. \$\endgroup\$Justme– Justme2024-03-25 22:19:36 +00:00Commented Mar 25, 2024 at 22:19
- 
        \$\begingroup\$ Yes, the data will have to be encoded and I need a larger buffer than 4 kBytes for that. I may or may not be notified in the source code right before the corruption occurs, I'm still waiting for an answer from support. The highest level of support has confirmed that this problem is unsolvable. \$\endgroup\$arnold_w– arnold_w2024-03-25 22:38:09 +00:00Commented Mar 25, 2024 at 22:38
- 
        \$\begingroup\$ I don't understand how this data corruption occurs, is it in RAM, in flash, during transmission? The simplest technique would be, if you receive a packet with lots of errors, ask the sender to retransmit it... \$\endgroup\$bobflux– bobflux2024-03-25 22:42:26 +00:00Commented Mar 25, 2024 at 22:42
- 
        1\$\begingroup\$ "it temporarily loses power" ... can you not do something to address the issue, rather than trying to place a bandage on a largely unqualified and uncontrolled process (i.e: RAM losing its contents). You may want to look into bulk capacitance, or battery-backed / non-volatile RAM. \$\endgroup\$Attie– Attie2024-03-25 22:49:24 +00:00Commented Mar 25, 2024 at 22:49
- 
        1\$\begingroup\$ You mentioned that above, but I presume the "problem" that is unsolvable is "my RAM becomes corrupt after power failure", not "the RAM in this part is unstable when operating within the recommended parameters". Resolving the power loss seems like the only sensible solution here, otherwise you will loose data at some point, even if it only appears to be presenting ~15% errors at present. For example - how long do you lose power for, to what voltage does the supply sink... how controlled are these effects, and does extending either of them affect that 15% figure? \$\endgroup\$Attie– Attie2024-03-25 22:51:38 +00:00Commented Mar 25, 2024 at 22:51
1 Answer
every once in a while something happens which introduces 10-15% bit errors, scattered throughout the buffer
it temporarily loses power
Unfortunately, it's quite unlikely you'll be able to correct so much corruption without storing multiple whole copies and using these to "vote" on the correct value... and even then, due to the nature of how the errors are appaering (randomly in unpowered RAM), you will be incredibly lucky to get valid data out.
To give you some context, 10-15% is a very high error rate - schemes exist that are able to detect multiple bit-errors and correct only a few... often "SECDED" (single error correction, double error detection)... see the excellent 3Blue1Brown videos on this topic.
You're working with ~4KiB of data, which is 32,768 bits... which for a ~15% error rate, results in ~4,195 errors, way beyond what a simple scheme could achieve. The figure is still ~76.8 errors for a small 64-byte block.
Error detection and correction is a hot topic at the moment, with our storage technologies being pushed harder and harder - DDR5 has on-die ECC, and high capacity flash storage is no longer a reliable medium, but rather a game of statistics / probabilities. The algorithms that go into these devices are closely guarded secrets, and I can almost guarantee that they're nowhere near your 15% ask either.
I know it's not what you want to hear, but you should focus efforts on resolving the power issues you have, not searching for a magical (and free!) solution. It may also be sensible for you to investigate non-volatile storage like FRAM or battery-backed RAM (many parts have an amount of this built-in!)
It's also worth reviewing your software, to ensure these errors are indeed caused by power issues, and not one or more bugs.
The highest level of support has confirmed that this problem is unsolvable.
I presume "the problem" that is unsolvable is "my RAM becomes corrupt after power failure", not "the RAM in this part is unstable when operating within the recommended parameters".
- 
        \$\begingroup\$ ChatGPT says that Turbocode with code rate 1/2 (so I'd need an 8 kB buffer) and constraint length 7 or 9 should be able to handle 15% bit error rate. I don't know if it's hallucinating or if there's any truth to it. Is it feasible to have a Turbocode encoder and decoder in a microcontroller or is it too computationally heavy? What about LDPC? \$\endgroup\$arnold_w– arnold_w2024-03-26 02:24:01 +00:00Commented Mar 26, 2024 at 2:24
- 
        \$\begingroup\$ Well, you'd have to re-run the encoder each time you write to the RAM -- how often does that occur? And you'd still be vulnerable while the encoder is running -- what's the probability of the corruption occuring during that interval? It should be clear that even with a huge investment of resources, your overall reliability would still be abysmal. \$\endgroup\$Dave Tweed– Dave Tweed2024-03-26 03:22:19 +00:00Commented Mar 26, 2024 at 3:22
- 
        \$\begingroup\$ "ChatGPT says ..." put it in the bin (I think I need a policy of entirely disengaging from anyone who responds "ChatGPT says ..."). Do you own research, understand the problem and options available. Then try again. \$\endgroup\$Attie– Attie2024-03-26 12:42:22 +00:00Commented Mar 26, 2024 at 12:42