ExecFormer — Neural Vulnerability Detection

LLMs don't pay enough attention

There's a now-famous example that caught a lot of attention online. Someone asked a language model: "I live 20 feet from the car wash. I want to wash my car. Should I walk or drive?" Most LLMs confidently answer "drive." They pattern-match "car wash" to "drive car there" and never stop to consider that 20 feet is barely across a parking lot.

This is not a knowledge problem. The model knows what 20 feet means. It is an attention problem. The model did not look hard enough at the "20 feet" part before jumping to a conclusion.

A fascinating paper from December 2024 by Patel et al. showed something remarkable: simply repeating the user's prompt twice in the input dramatically improves accuracy on these kinds of questions for non-reasoning LLMs. The model does not need new knowledge. It just needs another chance to look more carefully at what is already there.

This was the seed of our entire project. What if instead of repeating the prompt, we repeat the thinking? What if we take certain transformer layers and run them a second time?

What layer duplication actually does

A transformer model processes your input through a stack of layers, typically 28 to 80 depending on the model. Each layer has two parts: an attention mechanism (which looks across all tokens and decides what is relevant to what) and an FFN, or feed-forward network (which retrieves stored knowledge and transforms the signal).

Layer duplication takes a contiguous block of layers and runs them twice. Same weights, no training, no changes. The second pass sees a slightly refined input: the output of the first pass. Think of it like re-reading a paragraph after you have already gotten the gist. You notice things you missed the first time.

Layer Duplication

Layer 0

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Layer 7

Layer 8

Layer 9

Layer 10

Layer 11

Standard model: 12 layers, single pass

Credit where it is due: this technique was discovered by David Ng (Repeat Yourself / RYS), who found that duplicating layers 45-52 on a 72B model improved benchmark scores noticeably, with zero training cost. We were surprised and excited when we found his work. It was the same principle behind our looped transformer in ExecFormer, but applied to the backbone itself.

The problem was figuring out which blocks to duplicate. A 72B model has 80 layers, giving thousands of possible configurations. Brute-force evaluation was not an option on our hardware budget.

Finding the right blocks to duplicate

If one duplicated block helps, why not duplicate all 80 layers? Because the benefit is block-specific. Some blocks improve the model. Others destroy it. And duplicating too many at once causes interference, where improvements cancel out or compound into noise.

We needed a way to search for which blocks help, and then stack multiple good blocks without them fighting each other.

Spectral screening: minutes, not hours

Evaluating every possible block configuration takes hours of GPU time per candidate. We needed a cheap filter. The idea: before running expensive benchmarks, measure how much each candidate block changes the model's internal representations when duplicated. We developed SBUID (Spectral Block Utility via Impact and Displacement):

\text{SBUID} = \text{BLOOD}_{\text{impact}} - \lambda \cdot \rho

where lambda = 6000

BLOOD_impact measures how much the duplication changes downstream layer behavior (good, meaning the block is doing something). Displacement (rho) measures how much the representations move indiscriminately (bad, meaning the block is just adding noise). Subtracting the noise isolates the useful signal.

SBUID screening results (72B model):

Spearman r = 0.515, p = 0.008 correlation with actual benchmark performance
Cross-validated r = 0.664
Runtime: ~20 minutes to screen all candidates on a 72B model
Narrows thousands of candidates down to ~25 worth testing

Dual-probe evaluation

For the top ~25 candidates from spectral screening, we evaluate each with a dual probe: 16 hard arithmetic questions with partial-credit scoring (~5 min per config on 72B, ~90 sec on 27B) and 20 emotional intelligence questions (~60 sec per config). The combined score is math x 50 + eq x 0.5, where each contributes roughly 50 points for a range of 0 to 100.

The dual probe catches both reasoning improvement (math) and generation quality preservation (EQ-bench). A block that helps math but destroys coherent text generation would score poorly overall.

Greedy stacking: the breakthrough

Previous work (including Ng's) only tested single blocks. We asked: can we duplicate multiple blocks simultaneously?

Our first attempt was straightforward: pick the two best individual blocks and duplicate both. The result was interference. The blocks were chosen independently and fought each other.

So we tried something different. Greedy iterative selection:

Find the best single block

Screen the original model, evaluate top candidates, apply the winner.

Find the best complementary block

Screen the modified model (with block 1 already applied). The screening now sees the new dynamics and finds blocks that are complementary, not just individually strong.

Repeat until diminishing returns

Screen the doubly-modified model for a third block. Stop when adding another block no longer improves the score.

The key insight: the second block is chosen after the first is applied, so the screening sees the modified dynamics. It finds blocks that work together, not just blocks that happen to be individually good.

Gemma-3-27B: the search in action

We ran the full pipeline on Google's Gemma 3 27B Instruct (62 transformer layers). The baseline combined score was 80.54.

StepConfigScoreDelta

Baseline—80.54—

Step 1Single: (20,21)83.76+3.22

Step 2Pair: (0,2)+(12,13)85.92+5.38

Step 3Triple: (0,2)+(12,13)+(47,48)87.80+7.27

The pattern that emerged was striking: early (0,2) + mid (12,13) + late (47,48). Three blocks from completely different regions of the network, each contributing something different. The early block refines the initial embedding representation. The mid block strengthens feature extraction. The late block polishes the final reasoning steps.

Qwen2-72B: stacking goes further

On the larger 72B model, stacking delivered even bigger gains. Starting from Ng's original single-block result:

ConfigScoreDelta

Baseline70.52—

Ng's single block (45,52)76.76+6.24

Our best pair (0,7)+(45,52)79.91+9.39

Whisper quad (4 blocks)82.58+12.06

Per-layer alpha triple84.07+13.55

From 70.52 (baseline) to 84.07 with our best configuration. That is +13.55 points, and +7.31 over Ng's original single-block result. Total pipeline cost: ~8 GPU-hours. Ng's brute force would require 3,241 evaluations. Ours: ~70. A 46x speedup with a better result.

The volume knob: alpha tuning

Running a block twice at full strength is like turning the volume to 11. Sometimes that is great. Sometimes it distorts. We needed finer control.

The alpha equation

We introduced a blending parameter alpha at the seam between the first and second pass:

h_{\text{out}} = h_1 + \alpha \cdot (h_2 - h_1)

The second pass produces a "correction" (h2 minus h1). Alpha controls how much of that correction to apply. At alpha = 1.0, you get standard duplication. At alpha = 0.0, you skip the second pass entirely. At alpha = 0.1, you get what we call "whisper mode": just a tiny nudge from the second pass.

Think of it like a mixing console:

alpha = 0.0Muted. Second pass ignored.

alpha = 0.1"Whisper." Gentle nudge from the second pass.

alpha = 0.5Half blend. 50/50 between first and second pass.

alpha = 1.0Full duplication. Standard approach.

alpha = 1.3Boosted. Second pass correction amplified.

Why alpha matters for stacking

When you stack multiple duplicated blocks, each one perturbs the signal. The first block's perturbation gets amplified by the second, and so on. Without alpha control, stacking more than two blocks usually destroys the model.

Whisper alphas (alpha = 0.02 to 0.15) on additional blocks solved this. The first block runs at full strength. The second at 0.15. The third at 0.05. Each adds a gentle refinement without destabilizing the signal.

Per-layer alpha: each layer is different

The real breakthrough came from realizing that each layer within a duplicated block has a different optimal alpha. For the 72B model's block (45,52), seven layers, the optimal per-layer alphas told a fascinating story:

LayerAlphaInterpretation

45 (L0)1.1Slight boost

46 (L1)1.0Standard

47 (L2)0.5Dampen (destructive FFN)

48 (L3)1.3Strong boost

49 (L4)1.0Standard

50 (L5)0.9Slight dampen

51 (L6)1.1Slight boost

Layer 47's alpha of 0.5 is particularly telling. It needs to be dampened because its FFN is destructive (more on this in the next section). Layer 48's alpha of 1.3 means its correction is so valuable that we actually want to amplify it beyond the raw second-pass output.

Result: single block with 7 per-layer alphas reached a combined score of 82.77 (vs 76.76 with uniform alpha = 1.0). That is +6.01 just from tuning 7 numbers.

Efficient search with Bayesian optimization

Tuning 7 to 21 alphas by grid search takes 300+ evaluations. We used Bayesian optimization (Optuna's Tree-structured Parzen Estimator), which models which alpha values are promising based on past evaluations and intelligently explores the search space. 60 evaluations reached a score of 83.97, within 0.1 points of the grid search optimum at 84.07. A 5x speedup.

Why it works (and when it doesn't)

This is where the story gets really interesting, and where we owe a debt to 3Blue1Brown's beautiful explanation of how LLMs store facts in MLP layers.

FFNs as associative memory

Each FFN layer in a transformer acts like an associative memory: a lookup table of sorts. When a representation comes in, the FFN's gate neurons activate in a specific pattern, and the output is a weighted combination of stored "value vectors." Each stored pattern is like a memory well. Think of a marble rolling on a landscape of hills and valleys, where each valley is a stored fact.

In the SwiGLU architecture that most modern LLMs use:

\text{FFN}(u) = W_{\text{out}} \cdot \left(\text{silu}(W_{\text{gate}} \cdot u) \odot (W_{\text{up}} \cdot u)\right)

Each intermediate channel is a memory cell: a query-sensitive gate (which facts are relevant?) multiplied by a value vector (what does that fact say?).

The basin-crossing problem

When the second pass runs, the input to each FFN is slightly different from the first pass, because the attention mechanism has refined it. Usually this is great for attention (it gets to look again, more carefully). But for the FFN, this slight perturbation can be catastrophic.

The marble analogy:

Imagine a landscape of memory wells. Each well stores a different fact. Your marble (the representation) sits in the correct well after the first pass. The second pass gives the marble a tiny push. If the well is deep and wide, the marble stays put and the same fact is retrieved. But if the well is shallow, or there is a competing well very close by, the push knocks the marble into the wrong well. The model retrieves a nearby but incorrect fact.

This is the FFN re-retrieval hypothesis: the second pass can corrupt factual recall by crossing basin boundaries in the FFN's energy landscape.

The evidence

We decomposed duplication into attention-only and FFN-only components on the 72B model. On layer 47 specifically: full duplication (attention + FFN) scored 77.45. Attention-only duplication (skip FFN second pass) scored 80.35. The FFN is actively destructive, and removing it from the second pass improved the score by 2.90 points.

We measured Jaccard instability: how much the FFN gate firing pattern changes between the first and second pass.

LayerGate StabilityInterpretation

450.354Very unstable

460.393Unstable

470.466Moderate (most destructive)

480.507Moderate

490.584More stable

500.606Stable

510.612Most stable

The correlation between instability and FFN harm: Pearson r = -0.89. The more the gates change between passes, the more the FFN hurts. This is exactly what the basin-crossing theory predicts. Changed gates mean the marble is landing in different wells.

Reasoning benefits, factual recall suffers

Full lm-eval benchmarks on standard datasets confirmed the pattern:

TaskEffectVerdict

IFEval (reasoning)+2.3%Improves

MuSR (reasoning)+1.3%Improves

BBH (reasoning)+0.97%Improves

MMLU-PRO (factual)-0.80%Hurts

MATH Hard (factual)+0.38%Flat

Reasoning tasks improve because attention re-computation helps the model look more carefully. Factual tasks degrade because FFN re-retrieval corrupts stored knowledge.

The scale effect

This mechanism is scale-dependent, and the reasons are intuitive:

Property9B Model72B Model

Second-pass norm inflation42%4%

Cosine similarity (h1 vs h2)0.9750.997

Memory wells per neuronFewer, widerMore, narrower

Larger models store more facts in superposition (more marbles crammed into the same landscape). The wells are narrower and closer together, so even a tiny push from the second pass can knock a marble into the wrong well. This explains why per-layer alpha tuning is so critical on large models, and why layer 47's FFN specifically needs to be dampened.

The attention side is purely positive

While the FFN story is about potential harm, the attention story is positive. The second pass gives the attention mechanism another chance to notice relationships it missed (like "20 feet" being relevant to transportation choice), refine which tokens are attended to given slightly updated representations, and sharpen the signal for downstream layers.

This connects directly back to the "Repeat the Prompt Twice" finding. Repetition helps because it gives the model more opportunities to attend to what matters.

It's not a one-model trick

We were curious whether this generalizes. So we tested across five architectures, spanning different model families (Qwen, Gemma), sizes (9B to 72B), and even architecture types (dense vs Mixture-of-Experts):

ModelParamsLayersBaselineBest ConfigGain

Qwen2-72B72B8070.52Triple + per-layer alpha+13.55

Gemma3-27B27B6280.54Triple (0,2)+(12,13)+(47,48)+7.27

Qwen3.5-27B27B6442.86Triple+37.19

Qwen3-30B MoE30B4827.76Single best+12.66

Qwen3.5-9B9B32—Limited benefitSmall

Larger models benefit more, and stacking more blocks helps more on larger models. The Qwen3.5-27B result (+37.19) is particularly striking, suggesting that some models have significant untapped potential in their existing weights.

What this means

Zero training cost. No gradient updates, no data needed. Just rearrange the execution order.

Zero extra memory. Duplicated layers share weights with the originals. VRAM usage is identical.

Minimal speed cost. 4-20% slower depending on configuration size.

Works with quantization. 4-bit NF4 quantization preserves the duplication benefit. A 72B model runs in 59GB with duplication enabled.

Connection to ExecFormer

The looped transformer in ExecFormer is, in a sense, layer duplication taken to its logical extreme. Each of the 16 iterations through our shared-weight block is equivalent to duplicating that block 16 times. Greedy stacking gives us a way to apply the same idea to the Gemma backbone itself, potentially pushing our F1 even further without increasing inference cost.

Open questions

We want to be honest about what we do not know yet. The FFN re-retrieval hypothesis is supported by strong correlations (Pearson r = -0.89) but has not been tested with causal interventions. The cross-validation holds (+2.49 on unseen questions vs +2.83 on the training set), but the sample sizes are small.

The biggest open question is adaptive gating: can we learn, at inference time, which inputs benefit from duplication and which do not? Right now we duplicate unconditionally. A lightweight gate that decides per-input whether to run the second pass could give us the reasoning benefit on hard problems while preserving speed on easy ones.

There is also the question of sublayer-selective duplication. Our results strongly suggest that repeating attention (beneficial) while skipping or dampening the FFN (potentially harmful) is the optimal strategy. Our hybrid configurations already show this works, achieving strong results with 35-65% less additional compute than full-block duplication.

We think this is a step toward adaptive computation: the idea that models should think harder on difficult problems and breeze through easy ones. Layer duplication is a crude but effective form of this. We are excited to see where it goes.

DeepPass: Making LLMs Think Twice