Linear Ensembles Erase LLM Watermarks: The Fragility of Distributional Perturbations in Production AI Systems

Why multi-model access breaks statistical watermarking and what it means for enterprise AI governance, compliance, and security

TL;DR

Linear ensembles of just 2 models reduce watermark detection rates from >99% to <5%, erasing statistical signatures via distributional averaging Linear Ensembles Wash Away Watermarks.
Physical AI systems (robotics, edge AI, multi-agent workflows) are high-risk: ensembles emerge naturally from redundancy, fallback models, and sensor fusion.
EU AI Act and NIST AI RMF compliance is at risk: watermarking schemes assume single-model access, but production systems are inherently multi-model.
Mitigation requires trade-offs: cryptographic watermarking survives ensembles but adds hardware dependencies; adaptive schemes improve robustness but increase latency.

Introduction: The Watermarking Paradox in the Age of Model Proliferation

The rapid democratization of large language models (LLMs) has created an urgent governance challenge: how do we reliably distinguish AI-generated content from human-authored text? Watermarking emerged as the leading technical solution, embedding imperceptible statistical signatures into token distributions to enable post-hoc detection. Early schemes like red-green lists (Kirchenbauer et al., 2023) and exponential minimum sampling (Aaronson, 2023) demonstrated near-perfect detection rates (>99% true positive rate at <1% false positive rate) under controlled conditions. By 2025, watermarking had transitioned from academic curiosity to enterprise mandate, with the EU AI Act (Article 52) explicitly requiring "technical measures to identify AI-generated content" and NIST's AI Risk Management Framework (AI RMF 1.0) recommending watermarking as a core transparency mechanism for high-risk systems (NIST AI RMF).

Yet this governance success story contains a critical flaw: watermarking schemes assume single-model access. In practice, modern AI systems rarely expose a single model. Instead, they deploy linear ensembles—weighted combinations of multiple LLMs—to optimize for cost, latency, redundancy, and specialization. A production system might route queries to:

A 7B-parameter model for low-latency edge inference (e.g., on NVIDIA Jetson Orin)
A 70B-parameter model for high-accuracy cloud inference
A fine-tuned specialist model for domain-specific tasks (e.g., legal or medical)
A fallback model when primary systems are unavailable

When users access these models concurrently (e.g., via API load balancing) or sequentially (e.g., via agentic workflows), the resulting text is a linear combination of watermarked distributions. The research presented in "Linear Ensembles Wash Away Watermarks" demonstrates that this trivial operation—averaging token logits—erases watermarks with near-certainty. A single linear ensemble of just two models reduces detection rates from >99% to <5%, even when watermarking schemes are otherwise robust to paraphrasing, translation, and adversarial attacks.

The Physical AI Stack: Where Watermark Fragility Becomes a Safety Risk

This vulnerability is not merely an academic concern—it directly impacts Physical AI systems where watermarking is increasingly deployed for safety, compliance, and traceability. Consider the Physical AI Stack:

Loading diagram...

In this stack, watermarking is often applied at the REASON layer (e.g., to trace LLM-generated action plans) or the ACT layer (e.g., to audit robotic commands). However, linear ensembles are ubiquitous in Physical AI:

Edge-Cloud Hybrid Inference (SENSE → COMPUTE → REASON)
- A robotics system might use a small on-device model (e.g., 7B parameters on Jetson Orin) for real-time obstacle avoidance and a large cloud model (e.g., 70B parameters) for high-level planning.
- The final action plan is a weighted combination of both models' outputs, erasing watermarks.
Multi-Agent Orchestration (ORCHESTRATE → REASON)
- A manufacturing cell might deploy specialized agents (e.g., one for quality inspection, one for predictive maintenance).
- The orchestrator (e.g., ROS 2 or Kubernetes) merges their outputs into a unified command stream, destroying watermark signals.
Fallback and Redundancy (COMPUTE → REASON → ACT)
- If the primary model fails (e.g., due to network latency), a fallback model takes over.
- The resulting text is a mixture of two watermarked distributions, rendering detection impossible.

The Watermarking Paradox: Governance vs. Reality

The core paradox is this: watermarking schemes are designed for a world where users interact with a single model, but production systems are inherently multi-model. This mismatch creates three critical failure modes:

False Negatives in Compliance Audits
- Under the EU AI Act, high-risk AI systems must "enable the identification of AI-generated content" (EU AI Act, Article 52).
- A manufacturing robot using a linear ensemble of two watermarked models would produce undetectable outputs, violating compliance despite good-faith efforts.
Safety Risks in Physical AI
- Watermarking is often used to trace the origin of robotic commands (e.g., to debug failures or assign liability).
- If a linear ensemble erases the watermark, root cause analysis becomes impossible, creating safety blind spots in autonomous systems.
Adversarial Exploitation
- Attackers can trivially bypass watermarking by querying multiple models and averaging their outputs.
- This is far cheaper and more reliable than adversarial attacks like paraphrasing or token substitution.

The Timeline: From Academic Curiosity to Production Crisis

The evolution of LLM watermarking and its collision with linear ensembles can be traced through four distinct phases:

Loading diagram...

The Core Vulnerability: Why Linear Ensembles Break Watermarks

To understand why linear ensembles are so effective at erasing watermarks, we must examine how watermarking schemes work at the token distribution level. Most schemes operate by perturbing the logits of the LLM's output distribution. For example:

Red-Green Lists (Kirchenbauer et al., 2023): Tokens are partitioned into "red" (watermarked) and "green" (non-watermarked) lists. During generation, the logits of red tokens are boosted by a fixed bias (e.g., +2.0), making them more likely to be sampled.
Exponential Minimum Sampling (Aaronson, 2023): A pseudorandom function selects a "watermark key" for each token position. The logits are then exponentially weighted based on this key, creating a detectable statistical bias.

The critical insight from "Linear Ensembles Wash Away Watermarks" is that these perturbations are additive in logit space. When two watermarked models are combined via a linear ensemble, the resulting logits are:

\text{logits}_{\text{ensemble}} = \alpha \cdot \text{logits}_{\text{model1}} + (1 - \alpha) \cdot \text{logits}_{\text{model2}}

where $\alpha \in [0, 1]$ is the ensemble weight. The watermark signal—a fixed bias added to specific tokens—is diluted by the averaging operation. The paper formalizes this intuition with a theoretical bound: for any watermarking scheme that adds a fixed bias $b$ to a subset of tokens, the detection rate $D$ for a linear ensemble of $k$ models satisfies:

D \leq \frac{1}{2} + \frac{1}{2} \cdot \text{erf}\left(\frac{b \sqrt{k}}{2 \sigma}\right)

where $\sigma$ is the standard deviation of the logits under the null hypothesis (no watermark). For $k=2$ and typical values of $b$ and $\sigma$, this bound collapses detection rates to <5% Linear Ensembles Wash Away Watermarks.

Real-World Implications: Where Linear Ensembles Are Inevitable

Linear ensembles are not a theoretical edge case—they are the default architecture in modern AI systems. Below are three high-impact scenarios where watermark fragility becomes a critical risk:

1. Edge-Cloud Hybrid Robotics

Use Case: A warehouse robot uses a 7B-parameter model on-device (Jetson Orin) for real-time navigation and a 70B-parameter model in the cloud for high-level task planning.
Ensemble Mechanism: The final action plan is a weighted average of both models' outputs (e.g., 70% cloud, 30% edge).
Watermark Failure: The cloud model's watermark is diluted by the edge model's output, rendering detection impossible Linear Ensembles Wash Away Watermarks.
Safety Risk: If the robot causes an accident, liability cannot be assigned because the watermark is erased.

2. Multi-Agent Orchestration

Use Case: A manufacturing cell deploys three specialized agents:
1. A quality inspection agent (fine-tuned for defect detection)
2. A predictive maintenance agent (fine-tuned for equipment monitoring)
3. A task planning agent (general-purpose LLM)
Ensemble Mechanism: The orchestrator (e.g., ROS 2) merges their outputs into a unified command stream.
Watermark Failure: Each agent's watermark is averaged out in the final command Linear Ensembles Wash Away Watermarks.
Compliance Risk: The system violates **EU Machinery Regulation (EU