Claude Code Debugger
Debugging memory that stores incidents and extracts patterns so you never fix the same bug twice.
The Problem
I noticed Claude Code would struggle with similar difficult bugs across sessions. A database connection timeout in week one would require the same diagnostic steps as a timeout in week three. No memory between sessions meant re-diagnosing root causes, re-testing fixes, and re-learning that connection pooling configuration was the culprit. The debugging intelligence was ephemeral.
Why It Matters
Difficult bugs consume disproportionate time. A subtle race condition might take four hours to diagnose the first time, but should take minutes the second time if you remember the pattern. Without memory, AI coding assistants reset to zero knowledge every session. They can’t learn that your Docker networking always needs host mode for local database access, or that your API gateway times out after exactly 29 seconds under load.
What I Built
Claude Code Debugger stores debugging incidents with symptom, root cause (confidence 0-1), fix approach, file changes, verification status, and tags. When investigating a new bug, parallel retrieval runs four search strategies simultaneously: exact match, tag match, fuzzy similarity (Jaro-Winkler), and category match. This 4x speedup versus sequential search surfaces relevant past incidents in under 200ms.
Auto-pattern extraction triggers when three or more similar incidents are detected (Jaccard similarity >0.7). The system generates reusable solution templates. For example, after fixing “API timeout” three times in different services, it extracts a pattern: check gateway timeout config, verify downstream service health, add circuit breaker. Future timeout incidents inherit this template as the starting point.
Parallel assessment (v1.4.0) spawns domain-specific assessors concurrently. Database issues trigger a database assessor, frontend bugs trigger a frontend assessor, API failures trigger an API assessor, and performance problems trigger a performance assessor. Each runs independently, scores the incident, and returns specialized context. This eliminates sequential bottlenecks where generic assessment wastes time.
Memory Architecture
The tool offers two storage modes. Local mode (.claude/memory/) keeps incidents scoped to a single project, ideal for proprietary codebases. Shared mode (~/.claude-code-debugger/) pools incidents across projects, useful for learning framework-specific patterns that apply broadly.
Quality scoring weights root cause analysis at 30%, fix details at 30%, verification at 20%, and documentation at 20%. High-quality incidents (score >0.7) rank higher in search results. Low-quality incidents (score below 0.4) trigger warnings suggesting additional detail before storage.
Trace Ingestion
Adapters ingest traces from OpenTelemetry, Sentry, LangChain, and browser performance APIs. An OpenTelemetry span with error status automatically creates an incident draft with stack trace, operation name, and duration. Sentry breadcrumbs become incident tags. LangChain run traces capture prompt failures with full input/output for debugging prompt engineering issues.
Testing and Distribution
The package includes 45 end-to-end tests covering incident CRUD, parallel retrieval, pattern extraction, and trace ingestion. npm trusted publishing with provenance ensures the package on npm matches the GitHub source commit. Each release includes automated tests, version bumping, and changelog generation.
Technical Decisions
Parallel retrieval required careful queue management. I implemented a concurrent executor with configurable max parallelism (default 4) to avoid overwhelming the file system on large codebases. Each strategy runs in an isolated promise, results merge with deduplication by incident ID, and final ranking combines scores across strategies.
The Natural library handles fuzzy matching and Jaccard similarity calculations. I benchmarked Levenshtein distance, Jaro-Winkler, and cosine similarity for symptom matching. Jaro-Winkler performed best for short phrases with typos (common in error messages), while cosine similarity worked better for long-form descriptions.
Pattern extraction uses clustering with DBSCAN (density-based spatial clustering). Incidents become vectors using TF-IDF on symptom and fix text. Dense clusters (minimum 3 incidents, epsilon 0.3) trigger pattern extraction. The centroid incident becomes the template, with variations from cluster members noted as alternatives.
Commander provides the CLI interface with subcommands for store, search, extract, status, and mine. The mine command processes recent audit trail files (last 7 days by default), auto-extracting incidents from Claude Code’s execution logs. This passive capture requires zero manual effort.