A team I worked with last year had a vision inspection model that hit 96.4% accuracy on their validation set, achieved 42 fps on the development workstation, and passed every benchmark they threw at it. The first production unit was installed on a busbar manufacturing line outside Lyon. Within nine days the line operator was overriding the model on more than half of its flagged defects, throughput on the inspection station had dropped by 18%, and the engineering lead was on the phone asking whether we could "just retrain the model with more data."
The model was fine. The retraining did not happen. What we did instead was instrument the deployed system end to end, and the diagnosis took four hours to isolate: the model was being quantised on a different toolchain than the one used in benchmark, the enclosure was hitting 71 °C and the accelerator was throttling to 60% of its rated throughput, the camera mount had shifted by 2.4 mm after a vibration event the week before, and the lighting was off-spec because the line had switched to a different SKU with a polished surface that reflected differently. Four independent failure modes, none of which was a bug in the model itself, and any one of which would have looked from upstairs like "the AI is not working."
This is the pattern I see in nearly every Physical AI deployment that struggles in the first three months of production. The model gets blamed because the model is the visible component. The actual failures are in the four corners of the system that the development workstation never had to handle. Below are the four production failure modes that account for the majority of incidents I see in industrial, automotive, and energy deployments — what each one looks like, why it shows up only in production, and the mitigations that actually work.
Failure mode 1 — Quantisation-induced accuracy regression
The most common failure mode and the one teams underestimate most often. A model is trained at FP32, validated at FP32, signed off at FP32. When it is time to deploy on an edge accelerator with finite memory and a power budget that does not accommodate FP32 inference, the model is quantised — to FP16, INT8, or in the more aggressive cases INT4. The accuracy on the benchmark set drops by a few tenths of a percent, the team waves the regression through, and the system ships.
Then the failures start. They are not uniform. The aggregate accuracy on the benchmark is still acceptable. But the per-class accuracy on the rare classes — the ones that matter — has collapsed. The model that was 96% accurate at FP32 is still 95% accurate at INT8 on the average input, but on the long-tail defects that drive the line stops it is now 71%. The benchmark set never had enough of those classes to surface the regression. Production has them every shift.
The mechanism is straightforward. Post-training quantisation compresses the dynamic range of weights and activations into 256 buckets (INT8) or 16 buckets (INT4). For the heavily populated regions of the input distribution, the bucketing is fine; the loss is genuinely small. For the tail of the distribution — the inputs that activate the rarely-used filters in the network — the bucketing destroys discriminative information that the FP32 model relied on. The model has not become worse on average; it has become worse precisely on the inputs that the production environment will throw at it most consistently.
Three mitigations work, in roughly increasing order of cost. The cheap one is post-training quantisation with a representative calibration set — not the benchmark set, but a calibration set sampled from production-like inputs with deliberate over-representation of the long-tail classes. Done right, this recovers most of the lost accuracy at no training cost. The middle option is quantisation-aware training, where the forward pass simulates quantisation during the last few epochs of training so the optimiser learns weights that quantise cleanly. The expensive option is mixed-precision deployment, where the sensitive layers stay at FP16 and the rest of the network runs at INT8 — heavier on memory and slightly slower, but accuracy-preserving on tasks where the cheap and middle options are not enough.
At Auralink the perception models that recommend maintenance actions for chargers run at INT8 on the edge accelerator with a calibration set that is regenerated quarterly from the production telemetry. The regeneration cadence is not a nice-to-have — without it the calibration set drifts away from the live distribution and the long-tail accuracy walks down month by month. The model registry tracks the calibration-set hash alongside the model weights for exactly this reason: a model deployed without its matching calibration set is a different model, and the evaluation that was signed off no longer applies.
The lesson generalises. If your team is comparing FP32 benchmark numbers to make a deployment decision, you are validating a model that is not the one going into production. Validate the quantised model on a production-like distribution. If you cannot construct that distribution, that itself is the finding — and the deployment is not ready.
Failure mode 2 — Thermal throttling on real silicon
The second failure mode is the one that most cleanly distinguishes lab work from production. A vision model running on a Jetson AGX Orin in an open lab, on a 22 °C bench, with the development kit's reference cooler, will hit its rated throughput at INT8 and achieve the published frame rate. The same module in a sealed IP65 enclosure on a factory floor in a 38 °C summer ambient, with fanless cooling because the customer specified no moving parts, will achieve 35% to 60% of that figure. The model has not changed. The thermal envelope has.
The mechanism is governor-level. Every modern edge accelerator — Jetson, Hailo, Qualcomm RB, Coral, anything with a meaningful TDP — has on-die temperature sensors and a frequency governor that scales clocks down when junction temperature crosses a threshold. The thresholds are conservative by design; the chip will throttle long before it actually fails, because thermal damage is permanent and clock throttling is recoverable. The result is a throughput curve that looks like a healthy 60 fps for the first six minutes of operation, then a step-down to 40 fps, then a slow decline to a steady-state floor that depends on the enclosure design. The team that benchmarked the system in the lab for ten minutes saw the peak. The team that runs it in production for eight hours sees the floor.
This matters because the model's nominal performance is rarely the binding constraint. The binding constraint is the worst-case sustained inference rate under the worst-case operating envelope. A vision inspection station that misses three frames per minute because the silicon is throttling will miss three defect candidates per minute, and the line operator will conclude that the model is unreliable. The fact that the model is correctly classifying every frame it does inference over is not visible to the operator. Throughput failures look like accuracy failures from upstairs.
The mitigations come in pairs at different layers of the stack. At the silicon layer: sustained-power benchmarking, which means running the inference workload for at least 30 minutes inside the production enclosure at the worst-case ambient and measuring the floor, not the peak. At the BOM layer: cooling design that targets the floor — sometimes a heatsink, sometimes a heatpipe to the enclosure wall, sometimes a small forced-air loop with a filtered intake, sometimes an undervolt that trades 15% of peak performance for a much higher steady-state floor. At the model layer: a smaller or more efficient model that fits inside the sustained envelope rather than the peak envelope. The right answer is usually a combination of all three, and it is almost never something the cloud-AI team can pattern-match from previous deployments.
Vectis AI — the connected-vehicle data platform we run — taught me how brutal this is on automotive timelines. A perception module that has to operate from −40 °C to +85 °C ambient, inside a metal enclosure, with no active cooling, on 12 V vehicle power has a sustainable thermal budget of maybe 8 to 12 watts depending on the mounting location. That is one tier of silicon. The model has to fit inside that tier. Choosing the model after choosing the hardware is the right order; choosing the model first and then discovering that the silicon for it does not survive the operating envelope is the failure mode I see most often on vehicle programmes. ISO 26262 has views on what "survives the envelope" means for an ASIL-B or higher item, and those views are not negotiable with a marketing data sheet.
The non-negotiable: benchmark in the enclosure, at temperature, for the full sustained duration. If you do not have those numbers, you do not have a system you can ship.
Failure mode 3 — Sensor drift the model never trained on
The third failure mode is the slowest and the most expensive to diagnose. Models trained on a curated dataset implicitly assume that the sensors producing the data in production have the same characteristics as the sensors that produced the training data. In a controlled environment that assumption holds. In a real deployment, sensors drift, get knocked, get replaced, get installed in a different mounting, see different weather, accumulate dust on the lens, switch firmware revision, and quietly change the statistical properties of their output in ways the model has never seen.
The symptoms creep. A vision model that was 95% accurate at week one is at 91% by week six, 84% by month three, and bafflingly inconsistent by month four. There is no single bug to point to. The model has not been retrained, the silicon has not changed, the inputs look superficially the same — but the distribution of pixel values has shifted by a fraction of a standard deviation, and the model's decision boundary, which sat right at the edge of that distribution, is now classifying differently. A lidar model on an autonomous vehicle starts misclassifying low-reflectivity targets after the truck has been operating in a region with heavy road salt residue that fogs the dome. An audio model on a defect-detection station starts firing false positives after the line replaced its compressor and the ambient acoustic signature shifted three decibels.
The first defence is detection. The system has to compute, on the device, a statistical summary of the input distribution at inference time and compare it to the training distribution's summary. The simplest version is a per-channel mean and variance check on the input tensor; the more sophisticated version is a learned drift detector — a small autoencoder, a population stability index, or a Kolmogorov-Smirnov test on a feature embedding — that emits a single drift score per minute and pushes a compressed signal upstream when the score crosses a threshold. The drift detector lives at the edge, runs in milliseconds, and operates whether or not the cloud link is available.
The second defence is process. Sensors are physical assets and they are abused. A camera mount on a busbar inspection station will be hit by a forklift at some point. A microphone on a wind-turbine nacelle will accumulate ice. A radar on a vehicle will be partially occluded by a sticker the driver applied at week four. Every sensor in the deployed fleet needs a maintenance cadence — a documented procedure for re-calibration, a lens-cleaning interval, a re-mount inspection, a firmware-version audit. The cadence is part of the system design; without it, the model is operating on a steadily worsening input and there is no human-in-the-loop signal that the inputs have degraded.
The third defence is data. The edge system has to be able to capture, on demand, examples of inputs that the drift detector flagged — not the full raw stream, which the link cannot carry, but a representative sample with metadata. Those samples become the seed of the next retraining round. Without them the team is reduced to guessing at why the model drifted, and the retraining produces a model that is once again brittle to the next drift event.
A practical anchor: at Auralink the chargers in the field produce a daily drift report — a small artefact summarising the input-distribution statistics across the fleet, broken down by region and by site type — that lands in the ops team's dashboard. The cadence is daily because the dynamics in question (seasonal weather, regional electricity mix, vehicle population shifts) move on a weekly-to-monthly timescale, and a daily summary is the resolution that catches them before they become a customer-facing incident. A monthly summary would catch the problem after it had cost a customer a contract.
Failure mode 4 — OTA update path that bricks devices
The most expensive failure mode and the one most often left for "version 2." An over-the-air update pipeline that does not have rigorous safety properties will eventually push an update that fails in the field. The failure can be partial — the model loads but the inference path crashes — or catastrophic — the device boots into an unrecoverable state and has to be physically retrieved. The cost difference between the two is two orders of magnitude.
Four patterns make the difference. A/B partitioning, where the device holds two complete system images and updates the inactive one before switching the boot pointer, so a failed update means the device boots into the previous image rather than into nothing. Cryptographic signing of every update artefact, with the public key fused into the device at manufacture, so an attacker who compromises the update server cannot push arbitrary firmware. Automatic rollback on health-check failure, where the first boot after an update has to satisfy a health probe within a defined window or the boot pointer reverts. Staged rollouts, where a new update reaches 1% of the fleet first, then 10%, then 100%, with a circuit breaker that halts the rollout if the health probe failure rate on the 1% cohort exceeds a threshold.
None of these are exotic. All four are table-stakes patterns from the embedded world that have been around for fifteen years — RAUC, Mender, Android's A/B updates all encode versions of them. The mistake teams make is treating them as nice-to-haves that they will add once the product is mature. The correct order is to wire all four into the bring-up of the first development device, before any model has been trained, so the OTA path is exercised continuously throughout development. An OTA path that has only ever shipped to ten devices in the lab is not validated; the first real-world rollout is the test, and the cost of a failed test on a thousand-device fleet is the bill that decides whether the product survives.
The pre-mortem checklist
Before any Physical AI system reaches a customer site, the engineering team should be able to answer the following twelve questions with concrete artefacts. If any answer is "we will figure it out later," the system is not ready to leave the lab.
- Has the deployed model been validated at the deployment quantisation, on a calibration set sampled from production-like inputs? Not the FP32 benchmark — the actual quantised artefact on the actual hardware.
- What is the per-class accuracy on the long-tail classes, and is the calibration-set hash tracked in the model registry alongside the weights?
- What is the sustained throughput, measured inside the production enclosure, at the worst-case ambient temperature, over a 30-minute window? Not the lab peak.
- What is the cooling design, and what is the silicon's steady-state junction temperature under worst-case load?
- What drift detector is running on each device, and what triggers an alert? Per-channel statistics, embedding-based, autoencoder — any of these, but something.
- What is the sensor maintenance cadence, and who owns it on the customer side?
- What samples does the device capture when the drift detector fires, and how do they reach the retraining pipeline?
- What does the OTA path look like in detail — A/B partitions, signed updates, health-check rollback, staged rollouts? Has it been exercised on at least 50 development cycles?
- What is the safe state if the inference path fails, and what triggers entering it? The model going dark is not a safe state if the host system is mid-action.
- What is the audit log retention, and what fields are captured per inference? Article 12 of the EU AI Act has views on this, and so does any team that has had to forensically reconstruct an incident.
- What is the worst-case time to recovery if a device fails in the field, including the manual retrieval path?
- Has a human-in-the-loop runbook been written for the first three production incidents, and has it been rehearsed?
The pattern across all twelve questions is the same. The model is one component of a system, and the system has to be designed for the failure modes that production will impose on it. The engineering work that produces good answers to these questions is the work that makes the difference between a pilot that ships and one that stalls — and it is the same work that the classification mistakes article describes from the regulatory side. Engineering rigour and conformity rigour are the same artefacts seen from two angles.
The system is the deliverable, not the model
The four failure modes here — quantisation regression, thermal throttling, sensor drift, OTA failure — are not exotic. They are the load-bearing problems that show up in nearly every Physical AI deployment I have walked into in the last three years. They are well known in the embedded and automotive worlds. They are unfamiliar to teams whose ML reflexes were formed in cloud environments, and that unfamiliarity is the cost gap between Physical AI engineering and cloud-AI engineering.
The remedy is the same in every case: design the system around the model, not the model on its own. The Physical AI Stack we publish — six layers covering hardware abstraction, real-time runtime, model artefacts, observability, governance, and the operational team — exists to make these failure modes addressable as engineering work rather than as production fire-drills. The pilot-to-production hardening engagements we run with industrial teams target exactly these four failure modes, sequenced against the 90-day clock that decides whether a pilot ever ships.
The model is the part the demo videos show. The system is the part that has to survive the next three years on a customer site. Build the second one, and the first one stops being where the project dies.
