AUTOGATE: LLM-Driven Clock Gating for RTL Power

A new framework pairs machine-learning waveform analysis with LLM-based RTL rewriting to automate fine-grain clock gating, and its results show why the easy wins shrink fast on already-optimized production silicon.

Fine-grain clock gating is one of the oldest and most reliable tricks for cutting dynamic power in a digital chip: when a register or block is not doing useful work in a given cycle, you stop toggling its clock so it stops burning switching energy. The technique is well understood, but applying it well across a large design is still surprisingly manual — an engineer has to understand the workload, find the registers that idle, and rewrite the RTL to gate them without breaking functional correctness. A new arXiv preprint introduces AUTOGATE, which the authors describe as the first agentic framework for industry-grade RTL power optimization, and the interesting story is less the headline number than how the numbers shrink as the designs get more realistic.

The two problems AUTOGATE sets out to solve are specific. First, large language models cannot ingest a waveform trace that spans millions of cycles — the trace is far too long for any context window, and most of it is uninformative. Second, naively letting an LLM rewrite a big hierarchical codebase risks both functional breakage and an unmanageable scale. AUTOGATE's answer is a division of labor: machine learning handles the data problem, and the LLM handles the code-transformation problem.

"Fine-grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual."— arXiv preprint 2606.17461, source

That sentence states the gap the work targets. The technique is effective but labor-intensive, and labor-intensive optimization gets applied unevenly — which is exactly the kind of repetitive, pattern-bound engineering work that a tooled-up automated flow is well suited to attack, provided correctness is preserved.

ML to read the waveforms, an LLM to rewrite the RTL

The architecture is a machine-learning and LLM co-design that, in the authors' words, bridges waveform-level analysis and RTL rewriting. An ML-based clustering algorithm distills raw toggling traces into compact, structured representations. Instead of asking a language model to read millions of cycles of switching activity — which it cannot do — the clustering step summarizes which signals toggle together and when they idle, producing a small representation that can guide where clock gating will pay off. The LLM then uses that guidance to identify and apply gating opportunities in the source, rewriting the RTL rather than reasoning over the raw waveform.

To handle scale, AUTOGATE uses a hierarchical multi-agent architecture that decomposes a large design into independently optimizable modules. This is the part of the design that addresses the second drawback the authors name — preserving correctness across deep design hierarchies — because each module can be optimized and checked in isolation before the results are coordinated across the hierarchy. The pattern of decomposing a problem so that no single agent has to hold the whole thing in context is a recurring theme in agentic engineering tools, and clock gating, with its naturally modular structure, is a reasonable fit.

The results, read honestly

The numbers are where this gets instructive. On the small-design suite, AUTOGATE reduces dynamic power by 49.31% on average — a large figure. On larger industrial-scale designs the gains are much smaller: 19.34% on NVDLA and 7.96% on BlackParrot, two well-known open-source hardware designs, and up to 6.86% on highly optimized proprietary production designs. The right way to read that gradient is not as a disappointment but as a reality check. Small designs have lots of low-hanging fruit because they were never aggressively power-tuned; a production chip that has already been through professional power optimization has had most of its easy gating done by hand, so an automated pass can only find what humans missed. A single-digit reduction on a heavily optimized production design is arguably a more impressive result than 49% on a toy, precisely because the baseline is so much harder to beat.

Several caveats apply, in keeping with the principle that announced is not shipping. This is a preprint, and dynamic power is one axis among several — clock gating adds gating logic and can affect timing and area, none of which the abstract quantifies. The single-digit production-design figure is reported as "up to," meaning it is a best case rather than a typical one. And the load-bearing claim across all of this is functional correctness: power that comes at the cost of a behavioral change is no saving at all, so the value of the whole approach rests on the verification that the multi-agent decomposition is designed to enable. The abstract asserts correctness preservation as a design goal; independent evaluation is where that claim gets tested.

It is worth dwelling on why clock gating in particular is a workload-dependent optimization, because that is what makes the ML half of the design necessary rather than decorative. Whether a register can be safely gated in a given cycle depends on whether it is doing useful work, and that depends entirely on what the chip is running. A block that idles during one workload may toggle constantly during another, so a gating decision made without reference to representative activity is a guess. That is the role of the toggling traces: they are the empirical record of what actually switches, and when. The challenge is that those traces are enormous — millions of cycles across thousands of signals — and most of the information in them is redundant. The clustering step is what turns that flood into something compact enough to reason about, which is precisely why the authors say the LLM never has to process raw waveform data directly. The architecture is, in effect, an admission that the two halves of the problem demand different tools: statistics for the data, language modeling for the code.

The broader signal is that EDA — the unglamorous tooling layer that turns intent into silicon — is becoming a serious target for ML-plus-LLM automation, and clock gating is a sensible beachhead because it is modular, well-defined, and verifiable. The realistic takeaway from AUTOGATE is the shape of its curve: automated optimization delivers its biggest wins exactly where human effort has been thinnest, and shrinks to the margins on the production silicon that has already been worked over. For teams deciding where to point these tools, that gradient is the actual finding. Follow the tool, not the chip: the leverage in power optimization increasingly sits in the flow, and how far it scales onto already-tuned production designs is the question that matters.

Can an Agentic Flow Cut RTL Power? AUTOGATE Reports 49% on Small Designs, Single Digits at Industrial Scale

ML to read the waveforms, an LLM to rewrite the RTL

The results, read honestly

Comments