How StackBench Works
This guide explains StackBench’s internal architecture, pipeline stages, and data flow. Understanding how StackBench works helps you troubleshoot issues, interpret results, and contribute to the project.
Architecture Overview
StackBench implements a five-stage pipeline that transforms repositories into actionable insights about coding agent performance:
Repository → Clone → Extract → Execute → Analyze → Results
↓ ↓ ↓ ↓ ↓
GitHub Workspace Use Cases Code Insights
Each benchmark run progresses through distinct phases with comprehensive state management ensuring reliability and resumability.
Pipeline Stages
Stage 1: Clone - Repository Setup
Command: stackbench clone <repo-url>
Purpose: Create isolated workspace and prepare repository for analysis
What happens:
- Generate unique run ID (UUID) and create directory structure
- Clone target repository to
./data/<run-id>/repo/
- Remove non-documentation files - Keep only
.md
and.mdx
files - Initialize run context with configuration and metadata
- Scan for documentation files in specified folders
- Validate repository structure and prepare for extraction
Why remove code files? StackBench focuses on testing how coding agents use documentation alone to implement features. By removing source code files, we ensure agents rely solely on:
- README files and documentation
- API references and guides
- Examples and tutorials
- Usage patterns described in docs
This provides a pure test of documentation quality and agent comprehension.
Key components:
- RepositoryManager: Handles Git operations and workspace setup
- RunContext: Tracks run state and configuration
- Directory structure: Isolated environment per benchmark run
Files created:
./data/<run-id>/
├── repo/ # Cloned repository
├── run_context.json # Run configuration and state
└── data/ # Future benchmark artifacts
Stage 2: Extract - Use Case Generation
Command: stackbench extract <run-id>
Purpose: Generate realistic coding tasks from library documentation
What happens:
- Scan documentation files (
.md
,.mdx
) in specified folders - Analyze content with DSPy using OpenAI
- Extract library-specific patterns and common use cases
- Generate diverse use cases covering different complexity levels
- Validate use case structure and requirements
- Store structured data in
<run-id>/data/use_cases.json
Key components:
- DSPy Extractor: AI-powered content analysis and use case generation
- Located in:
src/stackbench/extractors/extractor.py
- Uses:
src/stackbench/extractors/modules.py
for DSPy modules - Signatures:
src/stackbench/extractors/signatures.py
for prompt structures
- Located in:
- Pydantic Models: Structured validation of generated use cases
- Defined in:
src/stackbench/extractors/models.py
- Defined in:
- Token Management: Efficient processing of large documentation sets
- Utilities in:
src/stackbench/extractors/utils.py
- Utilities in:
Extraction stages within this phase:
- Document scanning - Find all
.md
/.mdx
files in specified folders - Content preprocessing - Clean and structure documentation content
- DSPy analysis - AI-powered extraction using structured prompts
- Use case validation - Pydantic model validation and quality checks
- JSON serialization - Save structured use cases to
use_cases.json
DSPy workflow:
# Simplified extraction process
docs_content = scan_markdown_files(repo_path, include_folders)
use_cases = dspy_extractor.extract_use_cases(docs_content)
validated_cases = [UseCase.validate(case) for case in use_cases]
Generated use case structure:
{
"name": "Basic Query Implementation",
"elevator_pitch": "Demonstrates core querying patterns...",
"target_audience": "Developers new to the library",
"complexity_level": "Beginner",
"functional_requirements": [
"Import the main query module",
"Create a query with user input",
"Execute and return results"
],
"user_stories": [
"As a developer, I want to create simple queries..."
],
"system_design": "Follow repository patterns for query handling",
"architecture_pattern": "Factory pattern for query creation"
}
Stage 3: Execute - Implementation Phase
Purpose: Generate code solutions through coding agents
This stage has two distinct workflows depending on agent type:
IDE Agents (Manual Execution)
Agents: Cursor, VS Code with AI extensions
Workflow:
- Generate formatted prompts for human operators
- Provide execution guidance and target file locations
- Wait for manual completion of all use cases
- Validate solution files exist before proceeding
Human workflow:
# Get formatted prompt
stackbench print-prompt <run-id> -u 1 --copy
# Manual execution in IDE:
# 1. Open repository in Cursor
# 2. ⚠️ WAIT for Cursor indexing to complete (critical for context)
# 3. Paste prompt in chat
# 4. Let agent explore and implement
# 5. Saves to: ./data/<run-id>/data/use_case_1/solution.py
# Repeat for all use cases...
CLI Agents (Automated Execution)
Agents: OpenAI API, local LLMs, Claude Code
Workflow (Coming Soon):
- Create execution environment for each use case
- Execute agent with API calls providing use case context
- Generate solution files automatically
- Log execution output and error messages
- Track completion status and performance metrics
Automated workflow:
# Planned automated execution
for use_case in use_cases:
result = agent.execute_use_case(use_case, run_context)
Directory structure after execution:
./data/<run-id>/
├── repo/ # Original repository
├── data/
│ ├── use_cases.json # Generated use cases
│ ├── use_case_1/
│ │ └── solution.py # Implementation
│ ├── use_case_2/
│ │ └── solution.py # Implementation
│ └── ...
└── run_context.json # Updated with execution status
Stage 4: Analyze Individual - Per-Use-Case Analysis
Command: stackbench analyze <run-id>
Purpose: Evaluate each implementation for quality, correctness, and library usage
What happens:
- Test code executability by running each
solution.py
- Analyze library usage patterns (real vs mocked implementations)
- Extract documentation consultation from code comments
- Evaluate implementation quality using AI analysis
- Generate structured assessment for each use case
- Save individual analysis to JSON files
Analysis dimensions:
- Code Executability: Does the code run without errors?
- Library Usage: Real APIs vs mock/fake implementations
- Documentation Consultation: Evidence of using library docs
- Quality Assessment: Completeness, clarity, accuracy scores
- Failure Analysis: Specific errors and root causes
Key components:
- IndividualAnalyzer: Per-use-case analysis using Claude Code CLI
- Located in:
src/stackbench/analyzers/individual_analyzer.py
- Uses Claude Code CLI subprocess calls for analysis
- Configurable model and worker settings
- Located in:
- Code Execution Testing: Runtime validation by executing
solution.py
files - Pattern Recognition: Identifying common success/failure patterns
- Quality Scoring: Structured evaluation metrics
Claude Code Integration:
- CLI Tool: Requires
npm install -g @anthropic-ai/claude-code
- API Key: Uses
ANTHROPIC_API_KEY
environment variable - Model: Configurable via
CLAUDE_MODEL
(default: claude-sonnet-4) - Parallel Workers: Configurable via
ANALYSIS_MAX_WORKERS
(default: 3)
Generated analysis files:
./data/<run-id>/data/
├── use_case_1/
│ ├── solution.py
│ ├── use_case_1_analysis.json # Structured analysis results
│ └── use_case_1_analysis_messages.json # Complete Claude Code conversation
├── use_case_2/
│ ├── solution.py
│ ├── use_case_2_analysis.json # Structured analysis results
│ └── use_case_2_analysis_messages.json # Complete Claude Code conversation
└── ...
Analysis file contents:
use_case_N_analysis.json
: Structured analysis results (executability, library usage, quality scores)use_case_N_analysis_messages.json
: Complete Claude Code CLI conversation log with all messages, tool calls, and responses for debugging and transparency
Command: stackbench analyze <run-id>
(processes all use cases)
Stage 5: Analyze Overall - Final Report Generation
Purpose: Synthesize individual analyses into comprehensive insights
What happens:
- Aggregate individual analysis results from all use cases
- Calculate overall success metrics (pass/fail, success rate)
- Identify common failure patterns across use cases
- Generate actionable insights for library maintainers
- Create dual output formats (JSON + Markdown)
- Mark run as completed
Key metrics:
- Pass/Fail Status: Overall library readiness for coding agents
- Success Rate: Percentage of successfully completed use cases
- Common Failures: Top error patterns with frequency analysis
- Library-Specific Insights: Targeted recommendations for improvement
Report generation workflow:
individual_analyses = load_all_individual_analyses(run_context)
overall_analysis = {
"pass_fail_status": calculate_overall_status(individual_analyses),
"success_rate": calculate_success_rate(individual_analyses),
"common_failures": identify_failure_patterns(individual_analyses),
"library_insights": generate_targeted_insights(individual_analyses)
}
# Generate dual outputs
save_json_report(overall_analysis, "results.json")
save_markdown_report(overall_analysis, "results.md")
Final output files:
./data/<run-id>/
├── results.json # Structured data for programmatic access
├── results.md # Human-readable analysis report
├── run_context.json # Final run state (completed)
└── data/ # All execution artifacts
Run Context and State Management
RunContext Architecture
Core components:
- RunConfig: Repository URL, include folders, agent type, DSPy settings
- RunStatus: Phase tracking, completion flags, execution counts, error logs
- Directory Management: Automatic path resolution for all artifacts
- Persistence: Auto-saves state changes to
run_context.json
Phase Progression
Runs progress through seven distinct phases:
created → cloned → extracted → execution → analysis_individual → analysis_overall → completed
Phase transitions:
- Automatic advancement when completion criteria are met
- Validation checks before each phase transition
- Error tracking with detailed timestamps
- Resume capability from any interrupted phase
Phase completion criteria:
execution
: All use cases have execution status (success or failure)analysis_individual
: All executable use cases have analysis filesanalysis_overall
: Final report files generatedcompleted
: All work finished successfully
State Persistence
RunContext persistence:
{
"config": {
"repository_url": "https://github.com/user/repo",
"include_folders": ["docs", "examples"],
"agent_type": "cursor",
"dspy_settings": {...}
},
"status": {
"current_phase": "analysis_individual",
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T11:45:00Z",
"use_case_count": 5,
"execution_completed": true,
"individual_analysis_completed": false,
"overall_analysis_completed": false
},
"directories": {
"run_dir": "./data/abc123.../",
"repo_dir": "./data/abc123.../repo/",
"data_dir": "./data/abc123.../data/"
},
"errors": []
}
Agent Type Differences
IDE Agents
Characteristics:
- Human-in-the-loop: Requires manual interaction
- Repository context: Full access to cloned repository
- Interactive refinement: Can iterate and improve implementations
- Real-world simulation: Mirrors actual developer workflow
Execution flow:
setup → print-prompt → manual execution → analyze
CLI Agents (Future)
Characteristics:
- Fully automated: No human intervention required
- API-driven: Direct integration with AI services
- Batch processing: Can process multiple use cases in parallel
- Consistent execution: Eliminates human variability
Execution flow:
run → automatic execution → analyze
Directory Structure Deep Dive
Complete Run Directory
./data/<run-id>/
├── repo/ # Cloned repository (read-only)
│ ├── README.md
│ ├── docs/
│ └── examples/
├── data/ # Benchmark execution data
│ ├── use_cases.json # Generated use cases (extract stage)
│ ├── use_case_1/
│ │ ├── solution.py # Implementation (execute stage)
│ │ └── use_case_1_analysis.json # Analysis (individual analysis)
│ ├── use_case_2/
│ │ ├── solution.py
│ │ └── use_case_2_analysis.json
│ └── ...
├── run_context.json # Run state and configuration
├── results.json # Final structured results
└── results.md # Final analysis report
File Dependencies
Stage dependencies:
run_context.json ← Always maintained
↓
repo/ ← Clone stage
↓
data/use_cases.json ← Extract stage
↓
data/use_case_*/solution.py ← Execute stage
↓
data/use_case_*/use_case_*_analysis.json ← Individual analysis
↓
results.json + results.md ← Overall analysis
Next Steps
- Getting Started - Try the complete workflow
- CLI Commands - Complete command reference
Contributing to StackBench
Understanding the pipeline stages helps you contribute effectively:
- Agent implementations: Add evaluation for more coding agents
- Benchmark tasks: Add new types of tasks to expand what the benchmark evaluates (e.g. use of APIs via API docs)
- Metrics: Enhance quality assessment by adding or improving evaluation metrics
The modular architecture makes it easy to enhance individual stages without affecting the overall pipeline.