CS292C-15: Computer-Aided Reasoning for Software

CS292C-15: Computer-Aided Reasoning for Software
Lecture 15: AI/LLMs for Verification, Synthesis & Repair
YF
by Yu Feng
 
Agenda
1
Formal Methods Foundations
Explore how Formal Methods operate through Search & Ranking mechanisms at their core
2
LLM Integration
Examine how LLMs can enhance traditional heuristics with more robust and adaptive approaches
3
Domain Applications
Investigate three critical domains: Verification, Synthesis, and Repair through concrete, executable examples
4
Research Directions
Analyze emerging challenges and promising research opportunities in the field
Top-Down View: Search Space & Ranking
Flow: Spec → generate (ranking heuristic) → Candidate Program / Proof / Patch
Traditional Formal Methods
Handmade grammars, cost functions, greedy templates
Weighted clause learning, genetic fitness, random walks
Domain-specific heuristics
LLM-Enhanced Approach
Data-driven priors to propose candidates
Statistical models to rank candidates
Replaces ad-hoc heuristics with learned patterns
Where LLMs Plug In
Input: Spec & Tests
Starting point with specifications and test cases
LLM First Guess
Language model generates initial candidate solution
Formal Checker
SMT / proof kernel / tests verify correctness
Feedback Loop
If check fails → counter-example fed back to LLM
Cycle continues until proven correct or budget exhausted
Part A: LLM-Aided Verification
Verification traditionally relies on formal logic to prove program correctness against specifications.
LLMs can assist by:
Generating verification conditions from informal specs
Proposing invariants for complex loops
Translating between specification formats
Explaining verification failures in natural language
Specifications
Formal & informal requirements
LLM Processing
Generate verification artifacts
Formal Verification
Prove correctness with SMT solvers
Interpretation
Explain results in natural language
Task = Search for Proof Objects
Automated search for proof objects must navigate complex spaces while maintaining formal guarantees.
Search Space Components
Tactic sequences for proof construction
Loop invariants for program verification
Inductive measures for termination proofs
Ranking Mechanisms
Evaluate which tactics likely close proof goals
Prioritize invariants with highest probability of verification
Score candidate proofs based on simplicity and generality
Division of Labor
LLMs provide statistical priors over proof space
SMT solvers and ITPs guarantee logical soundness
Counterexamples guide refinement of proof candidates
Example 1 – Loop Invariant Suggestion
For correctness verification, we need to identify a loop invariant:
int sum_n(int n){
  int s=0,i=0;
  while(i<n){         /* ? */
     i++; s+=i;
  }
  return s;
}
// want  s == n*(n+1)/2
The LLM can suggest this invariant by recognizing the pattern of summing consecutive integers.
Why this works:
Initially: i=0, s=0, and 0*(0+1)/2 = 0 ✓
Each iteration: i increases by 1, s increases by the new value of i
When loop exits: i=n, therefore s=n*(n+1)/2 ✓
Without a correct invariant, formal verification would fail to prove the implementation meets the specification.
Example 2 - Smart Contract Invariant
Smart contracts require invariants to ensure security and correctness of financial transactions.
contract TokenVault {
  mapping(address => uint) balances;
  uint totalSupply;

  // Invariant: sum(balances) == totalSupply
  
  function transfer(address to, uint amount) public {
    require(balances[msg.sender] >= amount);
    balances[msg.sender] -= amount;
    balances[to] += amount;
  }
}
Why this invariant matters:
Prevents token creation or destruction during transfers
Ensures conservation of assets across all operations
Protects against economic exploits that could drain funds
Without verifying this invariant holds before and after each transaction, the contract could allow double-spending or token inflation, potentially leading to millions in financial losses.
Learning Contract Invariants Using Reinforcement Learning. ASE, 2022
Example 3: o3 LLM → CVE-2025-37899 in Linux ksmbd
Context – ksmbd is the in-kernel SMB 3 server.
Researcher Sean Heelan fed ~3.3 k LoC of the session-setup command handler (call-depth 3) to OpenAI's o3 model.
Prompt goal: "find use-after-free vulnerabilities".
o3 output pinpointed a path where a sess->user object is freed under one thread and later accessed by another, yielding a remote use-after-free (UAF) in the logoff code path.
Assigned CVE-2025-37899; fix merged upstream.
Significance: first public case of an LLM independently reasoning about concurrency to discover a zero-day in the Linux kernel.
PART B: LLM-Guided Synthesis
Task = Search for Program in DSL
1
Search Space
Grammar-generated expressions / sketches with holes
Ranking
Which sketch likely realizes the specification?
Division of Labor
LLM writes high-probability sketch; solver fills holes
Example 4 – Max of Two (C + Sketch)
// Initial sketch with missing implementation
int max(int x,int y){
    ??
}

// LLM-generated sketch with holes
int max(int x,int y){
    if (x ??cmp?? y) return ??sel1??;
    else             return ??sel2??;
}

// Completed implementation after solving
int max(int x,int y){
    if (x > y) return x;
    else       return y;
}
LLMs can efficiently search program space by generating sketches with holes that solvers can complete:
The LLM identifies the high-level structure (comparison + conditional return), while the solver determines the specific operator (>) and values (x, y) to satisfy the max function specification.
LLM-Guided Synthesis — When It Shines
✅ Sweet-spot: Tons of Repetitive Training Data
Task: Auto-transform spreadsheet entries
"[email protected]"  →  "Alice Smith"
"[email protected]"     →  "Bob Jones"
⋯ 100k similar rows
LLM strength:
Sees thousands of near-identical patterns in its pre-training corpus
Instantly suggests FlashFill-style DSL:
Capitalize(Replace(Split(email,"@")[0], "_", " "))
Minimal logical reasoning; pattern frequency dominates
LLM-Guided Synthesis — When It Struggles
❌ Weak-spot: Sparse Data, Deep Reasoning
Task: Synthesize insert function for a red-black tree from specification
Preserve: (1) binary-search order (2) red-black invariants
Why LLMs falter:
Very limited training data for complex data structures like verified red-black trees
Must handle multiple operations (color flips, rotations) while maintaining several invariants
LLMs alone often miss critical corner cases or produce incorrect solutions
Rule of thumb: Use an LLM to pattern-match in data-rich domains, but pair it with formal reasoning when specifications are sparse and correctness hinges on complex global logic.
PART C: LLM-Driven Repair
Task = Search for Patches to Fix Buggy Code
Search Space
Space of possible edits (insertions, deletions, modifications)
Ranking
Which patches are most likely to address the error without introducing new bugs?
Division of Labor
LLM Contribution
Propose high-probability patches based on error messages and context
Program Analysis
Verify correctness of patches using formal methods
Validation
Test suites validate functionality is preserved
LLM-Powered Program Repair — Best vs. Worst Scenarios
✅ Where LLMs Excel (Data-Rich, Repetitive Fix Patterns)
Example — Null-Pointer Guard Patch
- int len = strlen(s);   // crashes when s == NULL
+ if (!s) return -1;
+ int len = strlen(s);
Thousands of open-source software commits add a one-line null-check or length-check in virtually identical contexts. LLMs trained on GitHub learn this "patch idiom" and propose the correct guard immediately; test suites simply need to confirm the fix works.
❌ Where LLMs Struggle
Sparse Data, Deep Reasoning Required
Example: Cross-Thread Use-After-Free in Linux ksmbd
Bug occurs only when an auth failure thread frees sess->user and another thread dereferences it under rare timing conditions.
Complex Fix Required
Patch must add ref-counting or split critical sections—requiring deep reasoning about kernel locking, memory lifetimes, and ordering semantics.
Beyond LLM Capabilities Alone
Such fixes are rare in public corpora and require global concurrency analysis → pure LLMs often hallucinate or miss the real invariant.
Take-away: Let LLMs auto-patch common "cookie-cutter" defects, but keep formal tools in the loop for concurrency, logic, and other data-sparse bugs.
PART D: Pain Points & Challenges
Scalability Issues
Current approaches struggle with large codebases and complex specifications, often requiring significant computational resources to verify real-world systems.
Expressiveness vs. Automation Tradeoff
More expressive logics and specifications provide better modeling but reduce automation potential, creating tension between verification power and ease of use.
LLM Hallucinations
Language models may suggest plausible but incorrect invariants or program sketches, requiring robust validation mechanisms to ensure soundness.
Domain Knowledge Transfer
Translating domain-specific requirements into formal specifications remains challenging, creating a semantic gap between developer intent and verifiable properties.
Data Quality and Availability
The effectiveness of formal methods is limited by the amount and quality of available training data, especially for specialized domains where high-quality annotated examples of specifications, proofs, and program properties are scarce.
Hallucination & Unsound Suggestions
Non-existent APIs
LLMs may suggest fictional functions like "fast_mod_mul" that don't exist in actual libraries, leading to implementation errors.
Partial Proofs
Models often generate incomplete verification proofs using placeholders like "Admitted" instead of complete formal arguments.
Incorrect Constants
Critical errors in suggested cryptographic code with wrong constant values, potentially compromising security properties.
Spec Ambiguity & Over-Fitting
Spec Ambiguity & Over-Fitting
LLMs may generate patches that delete functionality while still satisfying tests, exploiting incomplete specifications.
Specification Quality Matters
Need richer specifications or hidden test suites to prevent LLMs from producing technically correct but undesired solutions.
Verification as Guardrails
Formal verification helps ensure LLM-proposed solutions maintain critical properties beyond simple test passing.
Key Take-Aways
Formal Methods = Search + Ranking
LLMs supply strong data-driven priors for guiding the search process
Application Areas
Verification → invariants & tactics; Synthesis → sketches; Repair → patches
Ensuring Correctness
Always keep a formal checker in the loop for soundness
Challenges
Current pain points: hallucination, specification gaps, scalability, auditability
Thank You!
CS292C: Computer-Aided Reasoning for Software
We appreciate your participation and engagement throughout this course.
We hope the concepts and techniques covered will be valuable in your future research and development work.