How StackBench Works

This guide explains StackBench’s internal architecture, pipeline stages, and data flow. Understanding how StackBench works helps you troubleshoot issues, interpret results, and contribute to the project.

Architecture Overview

StackBench implements a five-stage pipeline that transforms repositories into actionable insights about coding agent performance:

Repository → Clone → Extract → Execute → Analyze → Results
     ↓         ↓        ↓        ↓         ↓
   GitHub   Workspace  Use Cases  Code   Insights

Each benchmark run progresses through distinct phases with comprehensive state management ensuring reliability and resumability.

Pipeline Stages

Stage 1: Clone - Repository Setup

Command: stackbench clone <repo-url>

Purpose: Create isolated workspace and prepare repository for analysis

What happens:

Generate unique run ID (UUID) and create directory structure
Clone target repository to ./data/<run-id>/repo/
Remove non-documentation files - Keep only .md and .mdx files
Initialize run context with configuration and metadata
Scan for documentation files in specified folders
Validate repository structure and prepare for extraction

Why remove code files? StackBench focuses on testing how coding agents use documentation alone to implement features. By removing source code files, we ensure agents rely solely on:

README files and documentation
API references and guides
Examples and tutorials
Usage patterns described in docs

This provides a pure test of documentation quality and agent comprehension.

Key components:

RepositoryManager: Handles Git operations and workspace setup
RunContext: Tracks run state and configuration
Directory structure: Isolated environment per benchmark run

Files created:

./data/<run-id>/
├── repo/                    # Cloned repository
├── run_context.json        # Run configuration and state
└── data/                   # Future benchmark artifacts

Stage 2: Extract - Use Case Generation

Command: stackbench extract <run-id>

Purpose: Generate realistic coding tasks from library documentation

What happens:

Scan documentation files (.md, .mdx) in specified folders
Analyze content with DSPy using OpenAI
Extract library-specific patterns and common use cases
Generate diverse use cases covering different complexity levels
Validate use case structure and requirements
Store structured data in <run-id>/data/use_cases.json

Key components:

DSPy Extractor: AI-powered content analysis and use case generation
- Located in: src/stackbench/extractors/extractor.py
- Uses: src/stackbench/extractors/modules.py for DSPy modules
- Signatures: src/stackbench/extractors/signatures.py for prompt structures
Pydantic Models: Structured validation of generated use cases
- Defined in: src/stackbench/extractors/models.py
Token Management: Efficient processing of large documentation sets
- Utilities in: src/stackbench/extractors/utils.py

Extraction stages within this phase:

Document scanning - Find all .md/.mdx files in specified folders
Content preprocessing - Clean and structure documentation content
DSPy analysis - AI-powered extraction using structured prompts
Use case validation - Pydantic model validation and quality checks
JSON serialization - Save structured use cases to use_cases.json

DSPy workflow:

# Simplified extraction process
docs_content = scan_markdown_files(repo_path, include_folders)
use_cases = dspy_extractor.extract_use_cases(docs_content)
validated_cases = [UseCase.validate(case) for case in use_cases]

Generated use case structure:

{
  "name": "Basic Query Implementation",
  "elevator_pitch": "Demonstrates core querying patterns...",
  "target_audience": "Developers new to the library",
  "complexity_level": "Beginner",
  "functional_requirements": [
    "Import the main query module",
    "Create a query with user input",
    "Execute and return results"
  ],
  "user_stories": [
    "As a developer, I want to create simple queries..."
  ],
  "system_design": "Follow repository patterns for query handling",
  "architecture_pattern": "Factory pattern for query creation"
}

Stage 3: Execute - Implementation Phase

Purpose: Generate code solutions through coding agents

This stage has two distinct workflows depending on agent type:

IDE Agents (Manual Execution)

Agents: Cursor, VS Code with AI extensions

Workflow:

Generate formatted prompts for human operators
Provide execution guidance and target file locations
Wait for manual completion of all use cases
Validate solution files exist before proceeding

Human workflow:

# Get formatted prompt
stackbench print-prompt <run-id> -u 1 --copy

# Manual execution in IDE:
# 1. Open repository in Cursor
# 2. ⚠️ WAIT for Cursor indexing to complete (critical for context)
# 3. Paste prompt in chat
# 4. Let agent explore and implement
# 5. Saves to: ./data/<run-id>/data/use_case_1/solution.py

# Repeat for all use cases...

CLI Agents (Automated Execution)

Agents: OpenAI API, local LLMs, Claude Code

Workflow (Coming Soon):

Create execution environment for each use case
Execute agent with API calls providing use case context
Generate solution files automatically
Log execution output and error messages
Track completion status and performance metrics

Automated workflow:

# Planned automated execution
for use_case in use_cases:
    result = agent.execute_use_case(use_case, run_context)

Directory structure after execution:

./data/<run-id>/
├── repo/                    # Original repository
├── data/
│   ├── use_cases.json      # Generated use cases
│   ├── use_case_1/
│   │   └── solution.py     # Implementation
│   ├── use_case_2/
│   │   └── solution.py     # Implementation
│   └── ...
└── run_context.json        # Updated with execution status

Stage 4: Analyze Individual - Per-Use-Case Analysis

Command: stackbench analyze <run-id>

Purpose: Evaluate each implementation for quality, correctness, and library usage

What happens:

Test code executability by running each solution.py
Analyze library usage patterns (real vs mocked implementations)
Extract documentation consultation from code comments
Evaluate implementation quality using AI analysis
Generate structured assessment for each use case
Save individual analysis to JSON files

Analysis dimensions:

Code Executability: Does the code run without errors?
Library Usage: Real APIs vs mock/fake implementations
Documentation Consultation: Evidence of using library docs
Quality Assessment: Completeness, clarity, accuracy scores
Failure Analysis: Specific errors and root causes

Key components:

IndividualAnalyzer: Per-use-case analysis using Claude Code CLI
- Located in: src/stackbench/analyzers/individual_analyzer.py
- Uses Claude Code CLI subprocess calls for analysis
- Configurable model and worker settings
Code Execution Testing: Runtime validation by executing solution.py files
Pattern Recognition: Identifying common success/failure patterns
Quality Scoring: Structured evaluation metrics

Claude Code Integration:

CLI Tool: Requires npm install -g @anthropic-ai/claude-code
API Key: Uses ANTHROPIC_API_KEY environment variable
Model: Configurable via CLAUDE_MODEL (default: claude-sonnet-4)
Parallel Workers: Configurable via ANALYSIS_MAX_WORKERS (default: 3)

Generated analysis files:

./data/<run-id>/data/
├── use_case_1/
│   ├── solution.py
│   ├── use_case_1_analysis.json         # Structured analysis results
│   └── use_case_1_analysis_messages.json   # Complete Claude Code conversation
├── use_case_2/
│   ├── solution.py  
│   ├── use_case_2_analysis.json         # Structured analysis results
│   └── use_case_2_analysis_messages.json   # Complete Claude Code conversation
└── ...

Analysis file contents:

use_case_N_analysis.json: Structured analysis results (executability, library usage, quality scores)
use_case_N_analysis_messages.json: Complete Claude Code CLI conversation log with all messages, tool calls, and responses for debugging and transparency

Command: stackbench analyze <run-id> (processes all use cases)

Stage 5: Analyze Overall - Final Report Generation

Purpose: Synthesize individual analyses into comprehensive insights

What happens:

Aggregate individual analysis results from all use cases
Calculate overall success metrics (pass/fail, success rate)
Identify common failure patterns across use cases
Generate actionable insights for library maintainers
Create dual output formats (JSON + Markdown)
Mark run as completed

Key metrics:

Pass/Fail Status: Overall library readiness for coding agents
Success Rate: Percentage of successfully completed use cases
Common Failures: Top error patterns with frequency analysis
Library-Specific Insights: Targeted recommendations for improvement

Report generation workflow:

individual_analyses = load_all_individual_analyses(run_context)

overall_analysis = {
    "pass_fail_status": calculate_overall_status(individual_analyses),
    "success_rate": calculate_success_rate(individual_analyses),
    "common_failures": identify_failure_patterns(individual_analyses),
    "library_insights": generate_targeted_insights(individual_analyses)
}

# Generate dual outputs
save_json_report(overall_analysis, "results.json")
save_markdown_report(overall_analysis, "results.md")

Final output files:

./data/<run-id>/
├── results.json            # Structured data for programmatic access
├── results.md              # Human-readable analysis report
├── run_context.json        # Final run state (completed)
└── data/                   # All execution artifacts

Run Context and State Management

RunContext Architecture

Core components:

RunConfig: Repository URL, include folders, agent type, DSPy settings
RunStatus: Phase tracking, completion flags, execution counts, error logs
Directory Management: Automatic path resolution for all artifacts
Persistence: Auto-saves state changes to run_context.json

Phase Progression

Runs progress through seven distinct phases:

created → cloned → extracted → execution → analysis_individual → analysis_overall → completed

Phase transitions:

Automatic advancement when completion criteria are met
Validation checks before each phase transition
Error tracking with detailed timestamps
Resume capability from any interrupted phase

Phase completion criteria:

execution: All use cases have execution status (success or failure)
analysis_individual: All executable use cases have analysis files
analysis_overall: Final report files generated
completed: All work finished successfully

State Persistence

RunContext persistence:

{
  "config": {
    "repository_url": "https://github.com/user/repo",
    "include_folders": ["docs", "examples"],
    "agent_type": "cursor",
    "dspy_settings": {...}
  },
  "status": {
    "current_phase": "analysis_individual",
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-15T11:45:00Z",
    "use_case_count": 5,
    "execution_completed": true,
    "individual_analysis_completed": false,
    "overall_analysis_completed": false
  },
  "directories": {
    "run_dir": "./data/abc123.../",
    "repo_dir": "./data/abc123.../repo/",
    "data_dir": "./data/abc123.../data/"
  },
  "errors": []
}

Agent Type Differences

IDE Agents

Characteristics:

Human-in-the-loop: Requires manual interaction
Repository context: Full access to cloned repository
Interactive refinement: Can iterate and improve implementations
Real-world simulation: Mirrors actual developer workflow

Execution flow:

setup → print-prompt → manual execution → analyze

CLI Agents (Future)

Characteristics:

Fully automated: No human intervention required
API-driven: Direct integration with AI services
Batch processing: Can process multiple use cases in parallel
Consistent execution: Eliminates human variability

Execution flow:

run → automatic execution → analyze

Directory Structure Deep Dive

Complete Run Directory

./data/<run-id>/
├── repo/                           # Cloned repository (read-only)
│   ├── README.md
│   ├── docs/
│   └── examples/
├── data/                           # Benchmark execution data
│   ├── use_cases.json             # Generated use cases (extract stage)
│   ├── use_case_1/
│   │   ├── solution.py             # Implementation (execute stage)
│   │   └── use_case_1_analysis.json  # Analysis (individual analysis)
│   ├── use_case_2/
│   │   ├── solution.py
│   │   └── use_case_2_analysis.json
│   └── ...
├── run_context.json               # Run state and configuration
├── results.json                   # Final structured results
└── results.md                     # Final analysis report

File Dependencies

Stage dependencies:

run_context.json ← Always maintained
    ↓
repo/ ← Clone stage
    ↓  
data/use_cases.json ← Extract stage
    ↓
data/use_case_*/solution.py ← Execute stage
    ↓
data/use_case_*/use_case_*_analysis.json ← Individual analysis
    ↓
results.json + results.md ← Overall analysis

Next Steps

Getting Started - Try the complete workflow
CLI Commands - Complete command reference

Contributing to StackBench

Understanding the pipeline stages helps you contribute effectively:

Agent implementations: Add evaluation for more coding agents
Benchmark tasks: Add new types of tasks to expand what the benchmark evaluates (e.g. use of APIs via API docs)
Metrics: Enhance quality assessment by adding or improving evaluation metrics

The modular architecture makes it easy to enhance individual stages without affecting the overall pipeline.