Back to home

ExecFormer: Teaching Neural Networks to Execute Code for Vulnerability Detection

Jie TaoLinwei Zhang

University of Florida

March 2026

0.000
F1 Score
0.000
Precision
0.000
Recall
0
Test CVEs

It started with a question

We saw two things that got us excited. First, Alexia Jolicoeur-Martineau showed in her paper "Less is More: Recursive Reasoning with Tiny Networks" that a single tiny recursive model with just 7M parameters could beat large language models on reasoning tasks. The key insight: you don't need billions of parameters if you can iterate. A small network, applied repeatedly, can build up complex computation step by step.

Second, we read a blog post from Percepta AI called "Can LLMs Be Computers?" where they showed that transformers can be trained to execute arbitrary C programs for millions of steps. They literally built a computer inside a transformer.

That made us wonder: what if we took a tiny recursive network, trained it to approximate a virtual machine that tracks memory state, and then fine-tuned a large language model alongside it? If we could backpropagate through a small neural network trained to simulate program execution, and attach it to an LLM that understands code semantics, we could create something that doesn't just pattern-match vulnerabilities. It would understand them.

The early experiments: can neural networks execute programs?

Before going big, we had to prove the concept. We built a custom abstract virtual machine with opcodes like MALLOC, FREE, WRITE, READ, CHECK, PUSH, POP, ADD, SUB, and BRANCH. Then we generated 500,000 synthetic abstract programs with perfect ground-truth labels for vulnerability states.

We trained a tiny looped transformer (just 231K parameters) on these programs. The architecture uses shared weights across iterations. Each iteration of the transformer corresponds to one step of program execution. Same weights, applied again and again, like a recursive function.

What we found:

  • 98.8% accuracy on hard VM traces with branching, loops, and pointer operations
  • 100% accuracy AND 100% adversarial robustness on abstract interpretation programs
  • The model learned to correctly execute all opcodes and track program state

The really exciting part: we could probe the hidden states at each loop iteration and literally watch the model tracking which allocations are live or freed, the stack depth, whether an access is in-bounds or out-of-bounds, and the program counter position. Each loop iteration refined the model's understanding, converging toward the correct execution state. This is exactly what abstract interpretation does in formal methods, but learned end-to-end.

What the model learned internally (linear probing results):

Program counter trackingR² = 0.991
Stack depth trackingR² = 0.925
Layer-wise separation (Cohen's d across loop iterations):
12.517.117.420.430.1

That R² of 0.991 on program counter tracking means the model's hidden states almost perfectly encode where execution is in the program. We could literally read the program counter from the neural activations. The increasing Cohen's d across layers shows that the model progressively separates vulnerability-related features, with each loop iteration adding more discriminative power.

We built 17 different variants (looped_1Lx32, looped_2Lx16, abstract_abstract_1Lx16, and more) and evaluated them extensively. The results confirmed our hypothesis: shared-weight iteration IS learned abstract interpretation.

Scaling up: from toy VMs to real vulnerabilities

With the foundation proven, we needed to bridge the gap to real C code. Real vulnerabilities don't come in neat opcode sequences. They're buried in thousands of lines of complex, messy, real-world code.

Our approach has three phases:

1

Synthetic pre-training

Train the looped transformer block on 500K synthetic abstract programs. This gives the loop block a strong prior for tracking memory safety state. 100% accuracy.

2

Cached transfer learning

We take Google's Gemma-3-27B model and extract token embeddings from real C code (the R2Vul dataset of real-world CVEs). These embeddings are cached to disk so we don't need to rerun the 27B model during training. A learned token gate selects the top 256 most informative tokens. These get projected down from 5,376 dimensions to 2,048 and fed into our pre-trained looped transformer. Verdict and CWE classification heads sit on top.

3

End-to-end fine-tuning

Joint optimization using LoRA adapters on the Gemma backbone, aligning the code representations with the vulnerability detection objective.

Model Architecture

ExecFormer combines LLM-based code understanding with abstract interpretation principles. Click any stage to learn more.

Results

ExecFormer achieves an F1 score of 0.800 on a test set of 306 real-world C memory safety CVEs, beating the previous state-of-the-art R2Vul (F1 = 0.780) by +0.020. Our best model (Exp5, seed 42) achieves precision of 0.793 and recall of 0.728.

The best configuration uses focal loss with alpha=0.75 and gamma=2.0, with an inference threshold of 0.30. We swept through seven focal alpha configurations and multiple random seeds. The winning run (Exp5, seed 42) hit F1=0.800, beating R2Vul by +0.020.

ModelF1PrecisionRecall
ExecFormer (ours)0.8000.7930.728
R2Vul 1.5B0.7800.7620.798
LineVul0.610--
Devign0.520--

How We Measure Performance

ExecFormer is evaluated on 306 held-out test functions from real-world CVEs. Below are the three primary classification metrics.

Precision

Of all the functions our model flagged as vulnerable, how many actually were?

Precision=TPTP+FP=122122+31=0.793\text{Precision} = \frac{TP}{TP + FP} = \frac{122}{122 + 31} = 0.793
Recall

Of all the actually vulnerable functions, how many did we catch?

Recall=TPTP+FN=122122+46=0.728\text{Recall} = \frac{TP}{TP + FN} = \frac{122}{122 + 46} = 0.728
F1 Score

The harmonic mean, balancing precision and recall.

F1=2PRP+R=0.800F_1 = 2 \cdot \frac{P \cdot R}{P + R} = 0.800
Comparison

R2Vul [2] achieves F1 = 0.780 on the same test set. ExecFormer improves this by +0.014, a meaningful gain on a dataset of only 306 functions.

Loading results...

Try it yourself

Use the interactive scanner below to analyze C code for memory safety vulnerabilities. Select one of the provided examples or paste your own code.

This is a live demo. The full ExecFormer model runs on GPU infrastructure with Gemma-3-27B. Results below are generated by our inference API.

Vulnerability Scanner

Paste C code below or select an example. ExecFormer will analyze it for memory safety vulnerabilities using neural abstract interpretation.

analysis.c
1
2
3
4
5
6
7
8
Analysis results

Paste code and click scan

DeepPass: Greedy Spectral Stacking

Can you make a pretrained LLM smarter without training, new parameters, or extra memory? We found that duplicating the right contiguous blocks of layers and running them twice (same weights) improves reasoning across model families. The challenge is finding which blocks to duplicate. Our spectral screening method (SBUID) narrows thousands of candidates to ~70, and greedy iterative stacking finds complementary blocks that work together.

Layer Duplication

L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12

12 layers, standard pretrained model

The key insight: attention benefits, FFN can hurt

The second pass helps attention re-compute what is relevant (like re-reading a paragraph). But it can hurt the FFN (feed-forward network), which stores facts as associative memories. A slightly perturbed input can retrieve the wrong fact. Per-layer alpha blending controls the volume:

hout=h1+α(h2h1)h_{\text{out}} = h_1 + \alpha \cdot (h_2 - h_1)
ConfigScoreDelta
Baseline (72B)70.52
Ng's single block76.76+6.24
Our per-layer alpha triple84.07+13.55

Tested across 5 architectures (9B to 72B, dense and MoE). Zero extra VRAM. 4-20% slower. Works with 4-bit quantization. 46x fewer evaluations than brute force.

Read the full DeepPass story

Builds on David Noel Ng's Repeat Yourself discovery that layer duplication improves LLM performance.

So what? Vulnerability detection has always been there

Here's what we think matters about this work, beyond the benchmark numbers.

Static analyzers have existed for decades. They work, but developers ignore them because of false positives. LLM-based approaches like R2Vul are better, but they're black boxes that pattern-match on surface features.

What we showed is that you can give an LLM a training signal from a smaller neural network that has learned to actually execute code. Not simulate it, not approximate it. Execute it, step by step, tracking every allocation, every free, every pointer dereference.

We think this principle extends far beyond vulnerability detection. If you can provide loss signals and training gradients from a small, mechanistically interpretable network that functions as a virtual machine, you can build LLMs that deeply internalize the code they work with. Instead of needing to run code to understand it, the model learns to run it internally.

Imagine code assistants that don't just suggest fixes but actually trace execution paths in their hidden states. Imagine compilers that learn optimization passes from data. Imagine debugging tools that can tell you not just what went wrong, but show you the exact execution trace that led to the bug, all computed inside the neural network.

That's the direction we're excited about. ExecFormer is the first step.

CWE Coverage

ExecFormer detects the most critical memory safety vulnerability classes from the MITRE CWE database.

CWE-787

Out-of-bounds Write

Writing data past the end or before the beginning of an allocated buffer.

CWE-125

Out-of-bounds Read

Reading data past the end of a buffer, potentially exposing sensitive data.

CWE-416

Use After Free

Referencing memory after it has been freed, causing undefined behavior.

CWE-415

Double Free

Calling free() on memory that has already been freed, leading to heap corruption.

CWE-401

Memory Leak

Failing to release allocated memory, causing resource exhaustion over time.

Evolution

ExecFormer was developed through five iterative phases, each building on the insights of the previous one:

0

Foundation

Abstract VM and synthetic data generation (500K programs).

1

Looped Transformer

Shared-weight iteration with FiLM modulation (231K params, 98.8% accuracy).

2

Cached Embeddings

Transfer to real C code via Gemma-3-27B (dev F1 > 0.75).

3

End-to-End

LoRA fine-tuning of full pipeline.

4

SOTA Model

Focal loss tuning, seed sweep, alpha=0.75, gamma=2.0, threshold=0.30, F1=0.800.

References

  1. [1] A. Jolicoeur-Martineau, "Less is More: Recursive Reasoning with Tiny Networks," arXiv:2510.04871, 2025.
  2. [2] Percepta AI, "Can LLMs Be Computers?" Percepta AI Blog, 2025.
  3. [3] D. N. Ng, "Repeat Yourself (RYS): Layer Duplication for LLMs," 2024.
  4. [4] C. Wen et al., "R2Vul: Learning to Rank for Vulnerability Detection," ICSE, 2025.
  5. [5] Google, "Gemma 3 Technical Report," Google DeepMind, 2025.
  6. [6] J. Tao, L. Zhang, "ExecFormer: Neural Abstract Interpretation for Memory Safety," University of Florida, 2026.
  7. [7] R. Patel et al., "Repeat the Prompt Twice: Improving LLM Reasoning Without Training," arXiv:2512.14982, Dec 2024.
  8. [8] 3Blue1Brown, "How LLMs Store Facts in MLP Layers," YouTube, 2024.
  9. [9] J. Tao, L. Zhang, "DeepPass: Greedy Spectral Stacking for LLM Layer Duplication," University of Florida, 2026.