2026-05-31
Clock tree synthesis tries to deliver the clock edge to every flip-flop at the same instant — zero skew is the ideal. But sometimes, deliberately making the clock arrive later at the receiving flip-flop is the only way to close timing. This trick is called useful skew, and it's one of the cleverest moves in the physical design playbook.
Consider two flip-flops in series with combinational logic between them. The launch flop (FF1) drives logic into the capture flop (FF2). The setup equation is:
T_clk ≥ T_clk-to-q + T_logic + T_setup − T_skew
where T_skew = T_clk_FF2 − T_clk_FF1. If you delay the clock to FF2 (positive skew), you effectively borrow time from the downstream path and give it to this one. The catch: that downstream path now has less time, because FF2 launches later into its next stage.
So useful skew is zero-sum. You're not creating time — you're moving slack from a path with surplus to a path that's failing. The CTS tool inserts a small buffer chain (a "skew buffer" — typically 50–150 ps of extra delay per cell) into one branch of the clock tree to shift its arrival time.
Real-world example: A 2 GHz CPU pipeline stage has a critical path through an ALU adder at 510 ps, but the period is only 500 ps. Setup fails by 10 ps. The next stage (a register file read) takes only 380 ps — 120 ps of slack. The CTS tool inserts a 15 ps skew buffer on the clock to the ALU's capture flop. Now the ALU has 515 ps available (passes), and the register-file path has 365 ps available (still passes with 105 ps margin). Both win.
The danger: useful skew steals from hold time. If FF2's clock is delayed too much, FF1's fast data can race through the logic and arrive at FF2 before FF2's old clock edge has captured the previous data. Rule of thumb: never apply more useful skew than (T_clk-to-q + T_logic_min − T_hold). Tools enforce this as a hold check at every corner.
Modern CTS tools (Cadence Innovus, Synopsys ICC2) do this automatically during concurrent clock and data optimization (CCD/CCOpt), shaving 5–15% off achievable clock period on tight designs without changing a single line of RTL.
