StackBench

How StackBench Works

This guide explains StackBench’s internal architecture, pipeline stages, and data flow. Understanding how StackBench works helps you troubleshoot issues, interpret results, and contribute to the project.

Architecture Overview

StackBench implements a five-stage pipeline that transforms repositories into actionable insights about coding agent performance:

Repository → Clone → Extract → Execute → Analyze → Results
     ↓         ↓        ↓        ↓         ↓
   GitHub   Workspace  Use Cases  Code   Insights

Each benchmark run progresses through distinct phases with comprehensive state management ensuring reliability and resumability.

Pipeline Stages

Stage 1: Clone - Repository Setup

Command: stackbench clone <repo-url>

Purpose: Create isolated workspace and prepare repository for analysis

What happens:

  1. Generate unique run ID (UUID) and create directory structure
  2. Clone target repository to ./data/<run-id>/repo/
  3. Remove non-documentation files - Keep only .md and .mdx files
  4. Initialize run context with configuration and metadata
  5. Scan for documentation files in specified folders
  6. Validate repository structure and prepare for extraction

Why remove code files? StackBench focuses on testing how coding agents use documentation alone to implement features. By removing source code files, we ensure agents rely solely on:

This provides a pure test of documentation quality and agent comprehension.

Key components:

Files created:

./data/<run-id>/
├── repo/                    # Cloned repository
├── run_context.json        # Run configuration and state
└── data/                   # Future benchmark artifacts

Stage 2: Extract - Use Case Generation

Command: stackbench extract <run-id>

Purpose: Generate realistic coding tasks from library documentation

What happens:

  1. Scan documentation files (.md, .mdx) in specified folders
  2. Analyze content with DSPy using OpenAI
  3. Extract library-specific patterns and common use cases
  4. Generate diverse use cases covering different complexity levels
  5. Validate use case structure and requirements
  6. Store structured data in <run-id>/data/use_cases.json

Key components:

Extraction stages within this phase:

  1. Document scanning - Find all .md/.mdx files in specified folders
  2. Content preprocessing - Clean and structure documentation content
  3. DSPy analysis - AI-powered extraction using structured prompts
  4. Use case validation - Pydantic model validation and quality checks
  5. JSON serialization - Save structured use cases to use_cases.json

DSPy workflow:

# Simplified extraction process
docs_content = scan_markdown_files(repo_path, include_folders)
use_cases = dspy_extractor.extract_use_cases(docs_content)
validated_cases = [UseCase.validate(case) for case in use_cases]

Generated use case structure:

{
  "name": "Basic Query Implementation",
  "elevator_pitch": "Demonstrates core querying patterns...",
  "target_audience": "Developers new to the library",
  "complexity_level": "Beginner",
  "functional_requirements": [
    "Import the main query module",
    "Create a query with user input",
    "Execute and return results"
  ],
  "user_stories": [
    "As a developer, I want to create simple queries..."
  ],
  "system_design": "Follow repository patterns for query handling",
  "architecture_pattern": "Factory pattern for query creation"
}

Stage 3: Execute - Implementation Phase

Purpose: Generate code solutions through coding agents

This stage has two distinct workflows depending on agent type:

IDE Agents (Manual Execution)

Agents: Cursor, VS Code with AI extensions

Workflow:

  1. Generate formatted prompts for human operators
  2. Provide execution guidance and target file locations
  3. Wait for manual completion of all use cases
  4. Validate solution files exist before proceeding

Human workflow:

# Get formatted prompt
stackbench print-prompt <run-id> -u 1 --copy

# Manual execution in IDE:
# 1. Open repository in Cursor
# 2. ⚠️ WAIT for Cursor indexing to complete (critical for context)
# 3. Paste prompt in chat
# 4. Let agent explore and implement
# 5. Saves to: ./data/<run-id>/data/use_case_1/solution.py

# Repeat for all use cases...

CLI Agents (Automated Execution)

Agents: OpenAI API, local LLMs, Claude Code

Workflow (Coming Soon):

  1. Create execution environment for each use case
  2. Execute agent with API calls providing use case context
  3. Generate solution files automatically
  4. Log execution output and error messages
  5. Track completion status and performance metrics

Automated workflow:

# Planned automated execution
for use_case in use_cases:
    result = agent.execute_use_case(use_case, run_context)

Directory structure after execution:

./data/<run-id>/
├── repo/                    # Original repository
├── data/
│   ├── use_cases.json      # Generated use cases
│   ├── use_case_1/
│   │   └── solution.py     # Implementation
│   ├── use_case_2/
│   │   └── solution.py     # Implementation
│   └── ...
└── run_context.json        # Updated with execution status

Stage 4: Analyze Individual - Per-Use-Case Analysis

Command: stackbench analyze <run-id>

Purpose: Evaluate each implementation for quality, correctness, and library usage

What happens:

  1. Test code executability by running each solution.py
  2. Analyze library usage patterns (real vs mocked implementations)
  3. Extract documentation consultation from code comments
  4. Evaluate implementation quality using AI analysis
  5. Generate structured assessment for each use case
  6. Save individual analysis to JSON files

Analysis dimensions:

Key components:

Claude Code Integration:

Generated analysis files:

./data/<run-id>/data/
├── use_case_1/
│   ├── solution.py
│   ├── use_case_1_analysis.json         # Structured analysis results
│   └── use_case_1_analysis_messages.json   # Complete Claude Code conversation
├── use_case_2/
│   ├── solution.py  
│   ├── use_case_2_analysis.json         # Structured analysis results
│   └── use_case_2_analysis_messages.json   # Complete Claude Code conversation
└── ...

Analysis file contents:

Command: stackbench analyze <run-id> (processes all use cases)

Stage 5: Analyze Overall - Final Report Generation

Purpose: Synthesize individual analyses into comprehensive insights

What happens:

  1. Aggregate individual analysis results from all use cases
  2. Calculate overall success metrics (pass/fail, success rate)
  3. Identify common failure patterns across use cases
  4. Generate actionable insights for library maintainers
  5. Create dual output formats (JSON + Markdown)
  6. Mark run as completed

Key metrics:

Report generation workflow:

individual_analyses = load_all_individual_analyses(run_context)

overall_analysis = {
    "pass_fail_status": calculate_overall_status(individual_analyses),
    "success_rate": calculate_success_rate(individual_analyses),
    "common_failures": identify_failure_patterns(individual_analyses),
    "library_insights": generate_targeted_insights(individual_analyses)
}

# Generate dual outputs
save_json_report(overall_analysis, "results.json")
save_markdown_report(overall_analysis, "results.md")

Final output files:

./data/<run-id>/
├── results.json            # Structured data for programmatic access
├── results.md              # Human-readable analysis report
├── run_context.json        # Final run state (completed)
└── data/                   # All execution artifacts

Run Context and State Management

RunContext Architecture

Core components:

Phase Progression

Runs progress through seven distinct phases:

created → cloned → extracted → execution → analysis_individual → analysis_overall → completed

Phase transitions:

Phase completion criteria:

State Persistence

RunContext persistence:

{
  "config": {
    "repository_url": "https://github.com/user/repo",
    "include_folders": ["docs", "examples"],
    "agent_type": "cursor",
    "dspy_settings": {...}
  },
  "status": {
    "current_phase": "analysis_individual",
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-15T11:45:00Z",
    "use_case_count": 5,
    "execution_completed": true,
    "individual_analysis_completed": false,
    "overall_analysis_completed": false
  },
  "directories": {
    "run_dir": "./data/abc123.../",
    "repo_dir": "./data/abc123.../repo/",
    "data_dir": "./data/abc123.../data/"
  },
  "errors": []
}

Agent Type Differences

IDE Agents

Characteristics:

Execution flow:

setup → print-prompt → manual execution → analyze

CLI Agents (Future)

Characteristics:

Execution flow:

run → automatic execution → analyze

Directory Structure Deep Dive

Complete Run Directory

./data/<run-id>/
├── repo/                           # Cloned repository (read-only)
│   ├── README.md
│   ├── docs/
│   └── examples/
├── data/                           # Benchmark execution data
│   ├── use_cases.json             # Generated use cases (extract stage)
│   ├── use_case_1/
│   │   ├── solution.py             # Implementation (execute stage)
│   │   └── use_case_1_analysis.json  # Analysis (individual analysis)
│   ├── use_case_2/
│   │   ├── solution.py
│   │   └── use_case_2_analysis.json
│   └── ...
├── run_context.json               # Run state and configuration
├── results.json                   # Final structured results
└── results.md                     # Final analysis report

File Dependencies

Stage dependencies:

run_context.json ← Always maintained
    ↓
repo/ ← Clone stage
    ↓  
data/use_cases.json ← Extract stage
    ↓
data/use_case_*/solution.py ← Execute stage
    ↓
data/use_case_*/use_case_*_analysis.json ← Individual analysis
    ↓
results.json + results.md ← Overall analysis

Next Steps

Contributing to StackBench

Understanding the pipeline stages helps you contribute effectively:

The modular architecture makes it easy to enhance individual stages without affecting the overall pipeline.