Wednesday, June 3, 2026

"Hello Wrld" ?!

The Case of the Missing Character: How a 1.5V Mistake Sent Me Chasing Ghosts

Some bugs are honest. They fail loudly, point at the cause, and let you fix them in an afternoon. This was not one of those bugs. This is the story of a single dropped UART character that dragged me through bitstream verification, power delivery networks, ferrite beads, regulator feedback capacitors, and block RAM readback theory before revealing that the real culprit had been sitting in the board's history the whole time.

If you are debugging something intermittent on a Lattice Certus-NX, maybe this saves you a few days.

THE SYMPTOM

I was running the stock Lattice Propel RISC-V demo, completely unchanged. The classic hello-world over UART. And every so often, one random character would go missing from the message. Not garbled. Not corrupted into a different byte. Just absent, as if it had never been sent.

One detail mattered a lot, though I did not appreciate it at first: the large RISC-V core dropped characters, while the same demo built with the small Nano core ran  perfectly. Same UART driver, same firmware, same board.

THE FIRST WRONG TURN: BITSTREAM VERIFY

The dropped characters made me suspicious that the configuration was not solid, so I went to verify the bitstream against the device. And that is where things got strange.

An erase-program-verify pass in one shot completed cleanly. But if I ran a second, standalone verify afterward, it failed with a single-bit readback error, always at the same address. SRAM verify, repeatable, same bit.

I burned real time here. I compared two boards. One verified clean, one did not. I noticed the failing board lacked a ferrite bead on VCCINT that the good board had. I looked at bulk capacitance. I added 47uF. No help. I looked at decoupling values, found a 470nF where Lattice recommends 100nF, and talked myself into and back out of that being relevant. I found a 100pF feedforward capacitor across the regulator's upper feedback resistor on the failing board and removed it. No help.

Eventually the decisive test: hold the design in reset, and the second verify passes. That confirmed the verify "failure" was never a hardware fault at all. It was simply readback catching block RAM that the running design had written after configuration. The configuration loads RAM with its initial values, the first verify sees those, the design runs and modifies the RAM, and the second verify sees the modified contents and reports a mismatch. Completely expected behavior for SRAM readback on a running design with writable memory.

In other words: the entire verify investigation was a side quest. It had Nothing to do with the dropped characters. The bead, the bulk caps, the feedforward capacitor were all real differences between the boards, but none of them were causing anything.

BACK TO THE REAL PROBLEM

A bitstream readback artifact does not drop UART characters. So I returned to the actual symptom and reconsidered it properly.

A single missing character, not a corrupted one, usually means a transmit overrun: firmware writing a byte before the UART is ready, with no proper TX-ready check. But the firmware was the unmodified Propel demo, and the same driver worked fine on the Nano core. If the driver had a handshake bug, both builds would fail. So the variable was the core, not the software.

I had already dropped the system clock from 108 MHz to 50 MHz, and the characters still went missing. That was an important clue I almost misread. A setup-Timing problem should improve dramatically at less than half the clock rate. It did not. So this was probably not classic setup slack. The surviving suspects were a hold violation (clock-speed independent), a clock-domain-crossing issue, or core supply marginality that only the large core exercised.

The large-core-fails / small-core-passes split kept pointing at the large build being marginal in some way the Nano core never stressed.

THE ACTUAL ROOT CAUSE

Then the piece of board history that explained everything:

The original development board, the one dropping characters, had at some Point seen 1.5V on its VCCINT rail. The Certus-NX core nominal is around 1.0V. That is roughly 50% over voltage. That FPGA had been electrically over-stressed.

Over-voltage damage on an FPGA rarely produces a cleanly dead part. It produces a device with degraded margins. It works for light, easy workloads and glitches under heavier dynamic stress. Which is exactly what I was seeing:

- The Nano core, small and undemanding, ran fine.

- The large RISC-V core, with far more switching activity and tighter margins,   occasionally missed a UART register load and dropped a character.

- Lowering the clock to 50 MHz did not help, because degraded silicon margin is   not a setup-timing problem you can clock your way out of.

Everything finally fit. The dropped character was a marginal, over-stressed device failing under load. The verify mismatch was an unrelated RAM-readback artifact I had chased down a rabbit hole. The board-to-board PDN differences were real but incidental.

The two boards, for the record:

- CR00103: the good board, never over-volted, ran clean.

- TEL0025: the bad board, the one that had seen 1.5V on VCCINT, the source of the dropped characters.

LESSONS

1. Know your board's history. The single most important fact in this entire investigation, that TEL0025 had seen 1.5V on VCCINT, was not an electrical measurement I took during debugging. It was board history. Had I known it on day one, I would have skipped everything else.

2. An over-stressed FPGA fails on the margins, not all at once. "Works with the small design, fails with the big one" is a classic signature of degraded Silicon rather than a logic or firmware bug. Light workloads hide the damage.

3. A clue that does NOT change behavior is as informative as one that does. Dropping 108 MHz to 50 MHz and still failing told me it was not setup timing. That negative result redirected the whole investigation.

4. Beware the seductive side quest. The bitstream verify mismatch looked like a smoking gun and ate a lot of time. It was a real, explainable phenomenon (SRAM readback sees runtime-modified block RAM) but completely unrelated to the symptom I actually cared about. When you find a second mystery mid-investigation, confirm it is connected to the first before chasing it.

5. n=2 board comparisons mislead. Comparing a "good" and "bad" board surfaced several differences (ferrite bead, feedforward cap, decoupling values), each of which looked causal and none of which were. With only two boards and uncontrolled differences between them, correlation is cheap and worthless.

6. Protect the replacement. Before dropping a new device onto TEL0025, the VCCINT regulator setpoint and feedback network need checking so it cannot over-volt again. A 1.5V event that killed one FPGA will happily kill its replacement.

WHERE IT STANDS

One TEL0025's FPGA is treated as damaged and is no longer trusted for characterization. Development moved to another board, a board that never saw the over-voltage, to confirm the unchanged large-core Propel demo runs clean at full speed. The design and the PDN were fine all along. The problem was a device that had been pushed to 1.5V on a 1.0V rail, and a long detour through everything except the board's own history.


No comments:

Post a Comment