Heavy ZK: Circuit Anatomy and Prover Optimization for Shielded NAVCoin Swaps

Companion to the private NAV OTC swaps design and the shielded-swap proven-live record. This post dissects the circuit that makes private NAVCoin swaps work, analyzes why proving is slow on the current devnet prover path, and lays out a benchmark-driven optimization roadmap toward mainnet-grade proving performance.

In this design, NAVCoins are assets that trade to NAV. The circuit treats each NAVCoin as an asset identity committed inside an Orchard-style note, then proves conservation and spend validity without revealing asset identity, value, or owner.

Scope and evidence level

Before the optimization story, here is the current status precisely:

The shielded swap circuit is implemented and has been proven live on a 6-validator WAN devnet with a single a651 ↔ pfUSDC shielded swap.
The implementation has gone through internal code review; 10 findings were fixed.
The design spec has been TIH-reviewed and rewritten, but the implementation has not completed an external third-party audit.
It is not mainnet-deployed, not load-tested, and not production-audited.
CPU speedup numbers below are now measured on this circuit and this 32-vCPU host. GPU numbers remain targets/projections until an ICICLE-Halo2 prover is integrated and benchmarked.

The engineering conclusion changed after measurement: the original minutes-level path was mostly repeated key generation, not proof verification. After in-process key caching and reducing the circuit from K = 16 to K = 15, the best measured CPU hot path is about 5.8 seconds to prove and 66 ms to verify. That is close to, but not under, the <5s CPU target.

The circuit at a glance

The shielded NAVCoin swap circuit is a Halo2 zk-SNARK over the Pallas/Vesta Pasta curve suite. It originally shipped at K = 16; the optimization sprint proved that it fits safely at K = 15:

2^15 = 32,768 circuit rows

For comparison, Zcash Orchard action circuits are commonly discussed at K = 21, or roughly two million rows. Row count alone is not a complete proving-time predictor — column count, lookup density, commitment scheme, backend implementation, CPU cache behavior, and parallelism all matter — but the comparison is still informative: this swap circuit is far smaller than a heavily optimized Orchard proving stack.

The circuit is designed to prove, in zero knowledge, that a shielded swap is valid:

two input notes are spent from an anchored commitment tree;
two output notes are created;
per-asset value conservation holds;
the spender is authorized;
nullifiers are derived correctly;
action data is bound to the chain domain;
public nullifiers and output commitments are distinct;

without revealing which assets, how much value, or which spend authority is involved.

Circuit anatomy: every constraint

1. Sinsemilla note commitments, `×4`: 2 inputs + 2 outputs

Each note carries a 1597-bit commitment message segmented into Sinsemilla pieces:

pool_domain || asset_tag_lo || asset_tag_hi || g_d || pk_d || value || rho || psi

The message is packed as six 250-bit pieces plus one final partial piece, committed under the domain:

postfiat.asset_orchard.note_commit.v1

The circuit re-derives each note commitment from private witness data:

asset_id, value, rho, psi, recipient

and wires the resulting commitment into the rest of the circuit. For input notes, the derived commitment feeds the Merkle path and nullifier derivation. For output notes, the derived commitment is constrained to the public output commitment in the instance.

This is the asset-typing extension. Standard Orchard notes bind recipient, value, rho, and psi. These notes additionally bind:

asset_tag_lo || asset_tag_hi

where the asset tag is a 256-bit SHA3-384-truncated commitment to the asset identity, split into two 128-bit limbs. That asset binding is what lets the circuit enforce per-asset conservation without revealing the asset.

2. Merkle anchor verification, `×2`: both inputs

Each input note must be proven to exist in the 32-layer Orchard commitment tree. The circuit recomputes the root from the note commitment, note position, and authentication path, then constrains it to equal the public anchor:

for each input note:
  compute_root(cmx, position, auth_path[32]) == public_anchor

This is the “you cannot spend a note that was not in the anchored tree” constraint.

3. Nullifier derivation, `×2`

Each spent note produces a public nullifier. The circuit derives the nullifier from the nullifier private key, rho, and the note commitment, using the nullifier Poseidon domain, then constrains it to equal the public nullifier:

nf = Poseidon^nullifier_domain(nk, rho, cmx) == public_nf

Consensus maintains the nullifier set. A reused nullifier is rejected.

4. Per-asset value conservation

This is the core swap constraint.

For each asset present in the inputs, the sum of input values for that asset must equal the sum of output values for that same asset:

for each asset a in inputs:
  Σ(input_value where asset == a) == Σ(output_value where asset == a)

The current layout implements this with a permutation/select gate controlled by q_conservation, enforcing that the two input legs map to the two output legs with value conservation. Combined with the asset-tag binding inside the note commitment, this prevents:

value inflation;
swapping one asset tag for another;
mismatched input/output assets;
merge/split attacks across different asset identities.

The equality relationships are proven over private asset tags. The public verifier sees only commitments, nullifiers, anchors, and action-binding data.

5. Spend authorization

Each input’s spend authority is proven through Orchard-style key relationships and RedPallas randomized verification keys:

rk = ak + [α]G         (randomized verification key)
pk_d = [ivk] · g_d     (diversified public key)
ivk = BLAKE2s(ak, nk, rivk)  (incoming viewing key)

The circuit proves the relevant ECC/key-derivation relationships hold for the note being spent. Consensus verifies RedPallas signatures over H_sig, the spend-authority hash that binds the full action transcript.

Together, the circuit and consensus signature check prove that the note owner authorized the spend without revealing the spend key.

6. Action binding: `H_action` + `swap_binding_hash`

The design uses a two-layer binding model:

Layer 1, consensus: consensus recomputes swap_binding_hash, H_action, and H_sig from canonical action fields.
Layer 2, circuit: the circuit re-derives H_action from the public instance and constrains it to instance rows 17/18.
Verifier check: the Halo2 verifier checks the proof against the exact public instance constructed by consensus.

consensus recomputes: pool_domain, eo_hash, H_action, swap_binding_hash, H_sig
circuit constrains:   H_action(public_instance) == instance[17..18]
verifier checks:      proof verifies against the consensus-constructed instance

Changing any action field — anchor, nullifier, randomized verification key, output commitment, fee, encrypted output, or domain data — changes the public instance. The old proof no longer verifies. Replaying across chains fails because the domain binding changes pool_domain and therefore the action hash.

7. Public distinctness

The circuit enforces public distinctness for nullifiers and output commitments:

nf_old[0] != nf_old[1]       (no same note spent twice in one action)
cmx_new[0] != cmx_new[1]     (no duplicate output commitment)

This is enforced with a nonzero-difference gate. It is defense-in-depth alongside consensus-level nullifier-set and commitment-set duplicate checks.

8. Range checks + nonzero gates

The circuit enforces:

Asset tag nonzero: the asset tag pair (asset_tag_lo, asset_tag_hi) is constrained not to be all-zero. The current layout allocates inverse-gate checks over the tag limbs / nonzero predicate.
Value nonzero: each swap value must be nonzero.
128-bit range checks on asset-tag limbs via lookup range tables.
64-bit range checks on values.

Together, the value range check plus nonzero check gives:

1 <= value <= 2^64 - 1

9. Domain binding

The circuit binds the proof to the intended chain and circuit domain:

chain_id
genesis_hash
protocol_version
pool_id      = asset-orchard-v1
circuit_id   = shielded_swap.asset_conservation.v1
note_version

These domain values are absorbed into H_action as constants through fixed-column domain-tag gates. A proof from a different chain, genesis, pool, circuit, or note version fails verification against the consensus-constructed instance.

Constraint count summary

Component	Approximate gate contribution
Sinsemilla note commitment, `×4`	~8,000 rows
Merkle path verification, `×2`, depth 32	~3,500 rows
Poseidon nullifier derivation, `×2`	~1,200 rows
ECC spend authorization, `×2`	~2,000 rows
Conservation + distinctness	~200 rows
Range checks + nonzero	~500 rows
`H_action` / `swap_binding_hash`	~800 rows
Total approximate usage	~16,000–18,000 rows

The circuit now fits inside K = 15:

2^15 = 32,768 rows

The sprint tested the next lower size as well. K = 14 fails with NotEnoughRowsAvailable, so K = 15 is the smallest viable parameter set without reducing constraints.

Why proving is slow: the honest analysis

The original devnet prover path took minutes on stock CPU-oriented paths. That was slow for a K = 16 circuit, and the sprint measured why.

The main reason was repeated key generation. The one-shot CLI path rebuilt the full proving key and verifying key around each proof:

K=16 cold path:
  proving-key build   341,879 ms
  proof generation     10,515 ms
  verifying-key build  18,081 ms
  proof verification       88 ms

The proof itself was seconds, not minutes. The key build made the operator-visible flow feel like minutes.

The three main bottlenecks are:

MSM: multi-scalar multiplication;
FFT: polynomial evaluation/interpolation;
field arithmetic: hot-loop Pasta field operations.

After key caching and K = 15, the measured hot path is:

K=15 hot path:
  proof generation      5,780 ms
  proof verification       66 ms
  proof bytes           6,816

Bottleneck 1: serial or under-parallelized MSM

MSM is usually the dominant prover cost, often on the order of 60–80% of proving time in PLONKish systems, depending on circuit shape and commitment scheme.

If MSMs run on one core, a 16-core or 32-core machine is mostly idle. Mature proving backends use parallel Pippenger-style MSM implementations, split work across cores, and tune bucket accumulation and memory layout.

The measurement corrected an earlier suspicion: stock halo2_proofs = "0.3.2" in this workspace does expose and enable multicore through maybe-rayon/Rayon. A single-thread control proved it:

K=16 default Rayon prove_ms        10,515
K=16 RAYON_NUM_THREADS=1 prove_ms  69,389
measured prove speedup              6.60x

Multicore is real and material. It is not sufficient by itself.

Bottleneck 2: serial or under-parallelized FFT

FFT work is usually the second major prover cost, often 15–30% depending on circuit layout.

Halo2 proving requires polynomial transformations over the evaluation domain. If those FFTs run serially, proving time scales poorly even for a relatively small circuit. Optimized Halo2 stacks parallelize FFTs and improve cache behavior.

Again, this is not just a configuration detail. The optimized Zcash/ECC-style Halo2 stack is already the dependency line here; further gains require profiling and either circuit-level work or a different proving backend.

Bottleneck 3: generic or less-optimized field arithmetic

The innermost loop is Pasta field arithmetic:

Pallas base field operations
Pallas scalar field operations
Vesta/Pallas curve operations

Depending on crate lineage, build flags, and CPU target, the prover may be using generic Rust big-integer paths rather than platform-tuned assembly/intrinsics. Optimized Pasta backends can materially improve MSM, FFT, and hash-gadget performance.

This is a real engineering gap: it is not enough to say “use more cores” if each field operation is also slower than it needs to be.

The combined effect

A small circuit with an unoptimized prover can be much slower than a larger circuit with a mature prover stack.

A plausible performance gap is:

serial MSM/FFT       → large multicore loss
generic field ops    → additional constant-factor loss
cache/layout issues  → additional backend loss

The earlier “50–100×” class of gap should be retired as a CPU claim. The measured CPU improvement from the low-risk sprint is much narrower but real: key caching removes repeated keygen in long-lived processes, and K = 15 brings hot proof generation to about 5.8s.

The actionable point is narrower and stronger: the next CPU step is flamegraph-level profiling of the remaining 5.8s, especially MSM, FFT, Sinsemilla, Merkle, lookup, and field-arithmetic costs.

Optimization results

The optimization path became empirical:

baseline measurement
→ multicore determination
→ in-process key cache
→ K=15 circuit parameter reduction
→ backend/fork decision
→ GPU prover scope

Tier 0: baseline and multicore determination

The old framing was “one Cargo feature gives 16–32×.” That was too strong.

Measured result:

halo2_proofs 0.3.2 default = ["batch", "multicore"]
multicore = ["maybe-rayon/threads"]

K=16 default Rayon:
  prove_ms       10,515
  verify_ms          88

K=16 single-thread:
  prove_ms       69,389
  verify_ms         637

Multicore is already enabled and worth about 6.6× on proof generation versus one thread. The remaining problem was not a missing feature flag.

Tier 1: key-cache hot path

The largest operational win was removing repeated key construction in long-lived processes.

Measured K = 16 hot-cache result:

cold proving-key lookup    341,142 ms
first proof                 10,054 ms
cold verifying-key lookup   20,095 ms
first verify                    94 ms
hot proving-key lookup           0 ms
second proof                 9,909 ms
hot verifying-key lookup          0 ms
second verify                   91 ms

This does not make a one-shot CLI invocation fast, because the process still has to build the key once. It does make repeated swaps in a long-lived prover or validator process fast enough that the proof itself becomes the bottleneck.

Tier 2: circuit-level K reduction

The circuit fit at K = 15 and failed at K = 14.

Measured K = 15 result:

K=15 cold path:
  pk_build_ms     330,005
  prove_ms          5,841
  vk_build_ms      10,233
  verify_ms            63
  proof_bytes       6,816

K=15 hot path:
  proof generation  5,780 ms
  verification         66 ms

The K reduction produced a measured 1.8× cold proof speedup and 1.71× hot proof speedup versus K=16.

Tier 3: backend/fork decision

The current dependency is already halo2_proofs 0.3.2 from https://github.com/zcash/halo2, the Zcash/ECC line used by Orchard. The visible alternative halo2-axiom 0.5.1 is not a safe drop-in for this sprint: it is a KZG/trusted-setup, nightly-only fork from the Axiom/PSE line.

So the CPU backend decision is:

keep halo2_proofs 0.3.2
keep multicore enabled
do not migrate to a different proof-system backend inside this sprint

Tier 4: GPU acceleration with ICICLE-Halo2

The GPU path is to offload MSM and FFT work to CUDA kernels using an ICICLE-Halo2-style backend.

The reason this is attractive is straightforward: MSM and FFT are exactly the workloads where GPUs can outperform CPUs when the circuit is large enough and data movement is managed well.

Published ICICLE/Halo2-related benchmarks report large speedups for compatible workloads on NVIDIA GPUs. For this circuit, that remains an external benchmark signal, not an internal result.

There are integration questions to answer:

Pasta curve support and backend compatibility;
transcript/proof compatibility with the consensus verifier;
memory transfer overhead;
whether K = 15 is large enough to saturate the GPU;
end-to-end time including witness loading and proof serialization;
deterministic deployment and reproducible builds.

The current measured CPU hot proof is about 5.8s. The GPU target is <2s, with sub-second as the stretch target, but it remains unmeasured until the ICICLE branch runs on real GPU hardware.

GPU provisioning: crypto-native compute

The crypto-native proving architecture remains compelling:

Akash Network: lease NVIDIA GPU capacity through an on-chain deployment flow, with providers bidding on the workload.
io.net: access an aggregated GPU marketplace through API-driven provisioning.

The intended StakeHub flow is:

StakeHub leases GPU capacity
→ deploys the ICICLE-Halo2 prover container
→ feeds the witness into the prover environment
→ receives the proof
→ submits the shielded swap action to PFTL
→ closes the lease

This keeps proof generation off-chain while using crypto-native compute markets for hardware provisioning.

One operational caveat matters: the prover sees the witness. A decentralized GPU marketplace supplies compute, but it does not automatically make witness handling trustless. StakeHub must treat the prover environment as a sensitive execution boundary, using hardened containers, ephemeral keys, encrypted transport, strict logs, and, where appropriate, TEEs or operator-controlled hardware.

Tier 5: deeper circuit-level optimizations

The low-risk K reduction is done. Deeper tuning now means changing gadget layout, not just parameters.

Candidate circuit optimizations:

Reduce gadget rows enough to approach K=14. K=14 currently fails, so this requires real constraint reduction.
Optimize the Sinsemilla gadget layout. Note commitments dominate the row count. Any reduction in message packing, lookup usage, or fixed-table layout can pay off.
Improve lookup table efficiency. Range checks and Sinsemilla lookups should be checked for table width, column usage, and row pressure.
Batch proving. If multiple swaps share structure or anchor data, proving several actions in one circuit may amortize fixed costs.

Further speedup from circuit-level work is possible, but it is now more invasive than the K=15 change and should be driven by a flamegraph/row-usage profile.

Tier 6: proving service architecture

For mainnet-scale operation, proving should be treated as an off-chain service, not something consensus performs.

The clean separation is:

PFTL      = verifies proofs in consensus
StakeHub  = orchestrates proving and transaction submission
GPU layer = supplies compute

A concrete flow:

user / StakeHub
→ lease GPU capacity on Akash or io.net
→ run Halo2 / ICICLE-Halo2 prover container
→ generate proof off-chain
→ submit proof + action to PFTL
→ PFTL verifies proof
→ close compute lease

PFTL does not need a transaction type for GPU leasing. GPU leasing is operator tooling. Consensus only needs the public instance, proof, and verification key semantics.

Optimization path summarized

CPU numbers here are measured on the PostFiat AssetOrchard circuit and this 32-vCPU host. GPU remains scoped, not measured.

Tier	Optimization	Measured / target result	Status
0	Baseline	K=16 proof `10.515s`; cold path `370.7s`	Measured
1	Multicore determination	Default Rayon is `6.60×` faster than one thread for proving	Measured
2	In-process key cache	hot key lookup `0ms`; K=16 hot proof+verify about `10.0s`	Landed
3	K=15 reduction	K=15 hot proof `5.780s`; verify `66ms`; proof `6,816 bytes`	Landed
4	Backend/fork migration	no safe drop-in fork; keep Zcash/ECC `halo2_proofs 0.3.2`	Decided
5	GPU backend via ICICLE-Halo2	target `<2s`; stretch sub-second	Scoped, unmeasured
6	Deeper circuit tuning	target: close remaining `0.8-0.9s` CPU gap	Future profiling

The speedups should not be blindly multiplied. Bottlenecks shift after each tier. For example, once MSM is parallelized, FFT or hashing may dominate; once GPU transfer overhead is included, a small K = 15 circuit may gain less than a larger circuit.

Benchmark evidence

The measured sprint artifacts are:

docs/status/zk-prover-baseline-benchmark.md
docs/status/zk-prover-multicore-determination.md
docs/status/zk-prover-key-cache-optimization.md
docs/status/zk-prover-k15-circuit-optimization.md
docs/status/zk-prover-circuit-deep-triage.md
docs/status/zk-prover-backend-fork-decision.md
docs/status/icicle-gpu-prover-scope.md
docs/status/zk-prover-optimization-results.md

The release soundness regression stayed green after the optimization:

cargo test -p postfiat-privacy-orchard \
  swap_consensus_verifier_accepts_real_proof_and_rejects_forged_nonconservation \
  --release -- --ignored --nocapture

The K=15 metadata pin test also stayed green:

cargo test -p postfiat-privacy-orchard \
  swap_full_shape_key_metadata_is_pinned_and_consistent \
  --release -- --ignored --nocapture

What this means

The shielded NAVCoin swap circuit is not just a paper sketch. It is implemented, code-reviewed internally, and proven on a WAN devnet. It includes Sinsemilla note commitments, Merkle anchor verification, Poseidon nullifier derivation, ECC spend authorization, per-asset conservation, action binding, public distinctness, range checks, and domain binding.

But the correct status is:

devnet-proven
code-reviewed internally
pending external audit
pending mainnet load testing
pending deeper CPU profiling and GPU prover benchmarks

The proving slowdown is now better understood. The minutes-level path was mostly cold proving-key generation. The proof itself was about 10.5s at K=16 and about 5.8s at K=15. Multicore is already active. The remaining CPU gap is either circuit/gadget optimization or a different proving backend.

The crypto-native GPU path remains strategically important. StakeHub can orchestrate off-chain proving using Akash or io.net GPU capacity, generate Halo2 proofs with an optimized backend, and submit only the proof and public action data to PFTL. PFTL remains the verifier; StakeHub handles execution; the GPU layer supplies compute.

That is the right separation of concerns. CPU projections have now been replaced with measurements; GPU remains the next measurement gap.

Implementation: postfiatl1v2 branch navcoin-market-ops-envelope, crates/privacy_orchard/src/asset_orchard_circuit.rs (the circuit), asset_orchard_sinsemilla.rs (the note-commitment gadget), verify.rs (the consensus verifier). Design spec: docs/specs/asset-orchard-swap-circuit-design-v2.md (TIH-reviewed, GPT-5.5-pro-rewritten). Current evidence: internal code review with 10 findings fixed, a live a651 ↔ pfUSDC shielded swap on the 6-validator WAN devnet, and measured CPU prover benchmarks on 2026-06-20. Pending: external audit, mainnet deployment, load testing, deeper CPU profiling, and GPU prover benchmarks.