Benchmarks

How does DeepSource compare with other AI review tools? The OpenSSF CVE Benchmark helps answer.

DeepSource is an AI code review platform purpose-built for deep code review and providing comprehensive feedback on code quality and security to AI agents. On the public OpenSSF CVE Benchmark, our novel hybrid analysis engine delivers industry-leading performance. We're excited to present our results, the methodology, and raw, verifiable data.

Security Review


Accuracy on OpenSSF CVE Benchmark

DeepSource
82.42%
OpenAI Codex
81.21%
Devin Review
80.61%
Cursor BugBot
78.79%
Greptile
73.94%
Claude Code
71.52%
CodeRabbit
61.21%
Semgrep (CE)
58.18%

DeepSource leads on overall accuracy at 82.42%, edging out Codex (81.21%) and Devin Review (80.61%).

The benchmark consists of code and metadata for over 200 real-life security vulnerabilities in JavaScript and TypeScript, which have been validated and fixed in open-source projects. It evaluates tools on two key metrics: their ability to detect the vulnerability (avoiding false negatives) and their ability to recognize the validated patch (avoiding false positives).

We're choosing accuracy as the hero metric for our evaluation. Accuracy measures how often the agent gets it right: detecting real vulnerabilities in vulnerable code, and recognizing that patched code is actually fixed.

Here are the full results across all tools we evaluated:

DeepSourceCursor BugBotSemgrepCodeRabbitClaude Code†GreptileDevinCodex‡
Diffs reviewed165165165165165165165165
Accuracy82.42%78.79%58.18%61.21%71.52%73.94%80.61%81.21%
Precision90.77%74.23%74.07%100%90.7%85.45%89.06%94.74%
Recall71.95%87.80%24.39%21.95%47.56%57.32%69.51%65.85%
F1 Score80.27%80.45%36.70%36.00%62.40%68.61%78.08%77.70%
Avg. time per diff143.77s189.88s90s124.81s43.92s165s148.2s208.68s

† Claude Code was tested using Opus 4.5. ‡ Codex was tested using GPT-5-Codex.

On F1 score, the metric that balances detection rate against false alarms, DeepSource scores 80.27%, within 0.2 points of Cursor Bugbot's 80.45%. The difference: Bugbot achieves its F1 through higher recall (87.80%) at the cost of significantly lower precision (74.23%), meaning more noise per real finding.

DeepSource maintains 90.77% precision while still catching 71.95% of vulnerabilities. This profile is better suited to production CI/CD where false positives erode developer trust.

Claude Code was the fastest at 43.92s average per review, while DeepSource came in at 143.77s with a total benchmark cost of $21.24 across all 165 entries.

Secrets Detection


F1 Score on proprietary secrets detection dataset

DeepSource
92.78%
Gitleaks
75.62%
Detect-Secrets
54.35%
TruffleHog
41.22%

DeepSource dominates the secrets benchmark with a 92.78% F1 score. It detected 453 out of 518 secrets with only 6 false positives — yielding 98.69% precision and 87.45% recall.

By contrast, Detect-Secrets achieved higher recall (87.45%) but at the expense of 696 false positives, pushing its precision down to 39.43%. TruffleHog was the most conservative, catching only 27.41% of secrets but with 83.04% precision. The gap here is stark: DeepSource is the only engine that simultaneously maintains >98% precision and >87% recall, making it production-ready without manual triage.

DeepSourceGitleaksDetect-SecretsTruffleHog
Perfect Matches453303377121
Partial Matches0187621
Missed Secrets6519765376
False Positives61069629
Accuracy87.45%58.49%72.78%23.36%
Precision98.69%96.98%39.43%83.04%
Recall87.45%61.97%87.45%27.41%
F1 Score92.78%75.62%54.35%41.22%

Methodology

To ensure a fair and reproducible comparison across all providers, we developed a standardized evaluation harness built around real-world CVE data. Each tool was given the same set of vulnerable code samples, sourced from the OpenSSF CVE dataset, and evaluated on its ability to surface the known security issues. Every provider's pipeline was fully automated, with consistent PR structures, identical code diffs, and a unified judging framework applied to all results.

The dataset consists of 165 entries spanning both unfixed (vulnerable) and fixed (patched) variants of each CVE, allowing us to measure both true positive detection and false positive suppression.

Below, we detail how each provider was integrated and invoked.

Shared eval infrastructure

For all PR-based review tools (Greptile, Bugbot, Codex, and Devin), we used a common repository scaffolding approach:

  • Repository creation: For each CVE entry, a fresh GitHub repository was created under a dedicated organization. The main branch was initialized with the entire source repository excluding the vulnerable file(s), establishing a clean baseline.
  • Feature branch construction: A feature branch (add-code) was then created with the vulnerable file(s) added, making the security-relevant code appear as new additions in the PR diff, exactly as a real contributor's code would surface during review.
  • Draft PR creation: A draft pull request was opened from the feature branch to main, triggering whichever automated review bot was installed on the repository.
  • Comment fetching: After allowing each tool time to complete its analysis, a separate fetcher script retrieved all bot-generated comments (inline review comments, issue comments, and PR review summaries) via the GitHub API. Comments were filtered by bot username and parsed into a normalized format.
  • Timing measurement: For each entry, we measured the analysis latency, the time elapsed between the trigger event (PR creation or explicit @bot review command) and the first bot response. This captures end-to-end wall-clock time including any queuing, indexing, or inference delays.
  • State management and resumability: All runners maintained a JSON state file tracking the status of each entry (processing, completed, failed), enabling safe resumption after interruptions and batch-level rate limiting to stay within API quotas.

Tool-specific details

The tools provide various interfaces for performing code reviews, some of which are CLI, REST APIs, and pull request comments. In order to avoid rate limits with VCS providers when opening PRs and retrieving comments, other interfaces, if available, were utilised.

DeepSource

DeepSource was evaluated using our (unreleased) code review API that powers our platform runs. The API was provided with repository bundles and the corresponding diff information to analyse. The API returned all issues detected in the diff. 25 diffs were analysed in parallel at a time. The analysis duration was measured by recording the API latency.

Greptile

Greptile was evaluated as a fully automated PR review bot. The runner script handled repository creation, branch setup, and PR opening, with Greptile's GitHub App automatically triggering analysis on each new PR. The PR body included @coderabbitai ignore to prevent cross-contamination from other installed bots.

Comment retrieval: The script fetched both inline review comments and general issue comments, filtering by Greptile's bot username (greptileai). The analysis time was calculated as the delta from the first non-Greptile comment (the trigger, typically an @greptileai review mention) to the first Greptile response. Both main and retry runs were supported via separate state files to handle transient failures without corrupting existing results.

  • Trigger mechanism: Automatic on PR creation (GitHub App webhook)
  • Comment types collected: Inline review comments, issue comments
  • Rate limiting: Batches of 5 PRs with 5-second inter-batch delays; 3 concurrent repo creations

Cursor Bugbot

Bugbot (Cursor's PR review agent) followed the same PR scaffolding approach with private repositories. The runner script created repositories, pushed code, and opened PRs that Bugbot would automatically review.

Comment retrieval: A script parsed Bugbot's distinctive comment format, which embeds file location metadata in hidden HTML comments (<!-- LOCATIONS START ... LOCATIONS END -->). These embedded location references were extracted using regex to produce structured file#L{start}-L{end} mappings, providing precise line-level attribution for each finding. The fetcher also collected PR reviews (top-level review summaries) in addition to inline and issue comments.

  • Trigger mechanism: Automatic on PR creation
  • Comment types collected: Inline review comments, issue comments, PR review summaries
  • Rate limiting: Semaphore-based concurrency (3 parallel), sequential batch processing

OpenAI Codex

Codex required a slightly different workflow since it operates as a comment-triggered agent rather than an automatic PR webhook. The script created duplicate branches (add-code-codex) from the existing add-code branches and opened separate draft PRs for Codex to review. A manual @codex review comment was then posted on each PR to trigger the analysis.

Comment retrieval: A script handled Codex's multi-modal response pattern: when Codex finds issues, it posts inline review comments; when it finds none, it posts an issue comment stating no problems were detected. The fetcher sorted all Codex-authored items chronologically and used the last item's type to determine the final verdict, if the last activity was inline review comments, those were collected as findings; if it was an issue comment, the entry was recorded as having zero detections. Analysis time was measured from the latest @codex review trigger to the latest Codex response.

  • Trigger mechanism: Manual @codex review comment on PR
  • Bot usernames: chatgpt-codex-connector, chatgpt-codex-connector[bot]
  • Comment types collected: Inline review comments, issue comments, PR review summaries
  • Rate limiting: Batches of 5 with 2-second delays; 3 concurrent PR creations

Devin Review

Devin Review (by Cognition) was integrated similarly to Codex, separate branches (add-code-devin) and PRs were created via a python script. Reviews were triggered using browser automation via Playwright, which transforms GitHub PR URLs to Devin Review URLs by replacing github.com with devinreview.com and opens each PR in a browser tab to initiate analysis.

The trigger script requires manual GitHub OAuth login on first run, then maintains a persistent Chromium profile for subsequent authenticated sessions. PRs are processed in batches of 5, with a 30-second wait after loading each batch to allow Devin's backend to register the review requests before closing tabs.

Comment retrieval: We used a dedicated trigger state file to record the exact timestamp of each review request. This allowed precise latency measurement from the trigger to Devin's first response, accounting for timezone normalization between local trigger timestamps and GitHub's UTC-based comment timestamps. The fetcher also captured Devin's PR review summaries (posted via the GitHub Reviews API with submitted_at timestamps), which Devin uses to indicate when no issues are found.

  • Trigger mechanism: Manual review request comment on PR
  • Bot username: devin-ai-integration[bot]
  • Comment types collected: Inline review comments, issue comments, PR review summaries
  • Rate limiting: Batches of 5 with 3-second delays

Claude Code

Unlike the PR-based tools, Claude Code was evaluated as a local CLI agent operating directly on the repository filesystem. A script ran Claude Code against each CVE's source repository without creating any GitHub PRs or branches.

Analysis workflow: For each entry, the runner performed an in-place diff simulation within the actual repository:

  1. The target file(s) were removed and the removal was committed.
  2. The files were restored from the original commit as staged changes.
  3. A git diff --cached was generated, producing the exact diff that a PR reviewer would see.
  4. Claude Code was invoked with a comprehensive security audit prompt via a custom slash command (/security-review), with the diff and file list embedded directly in the prompt.
  5. After analysis, the repository was restored to its original state via git reset --hard and stash recovery.

Prompt design: Claude Code received a detailed, structured prompt specifying the security categories to examine (injection, auth bypass, crypto issues, code execution, data exposure), severity guidelines, confidence scoring thresholds (minimum 0.7), explicit exclusions (DoS, on-disk secrets, rate limiting), and a mandatory JSON output schema. The prompt also instructed a three-phase analysis methodology: repository context research, comparative analysis against existing security patterns, and vulnerability assessment with data flow tracing.

Actual Prompt: Link

Output parsing: Claude Code's raw text output was parsed for embedded JSON containing structured findings with file paths, line numbers, severity levels, category labels, exploit scenarios, and confidence scores.

  • Invocation: claude /security-review --permission-mode acceptEdits (CLI, local execution)
  • Batch processing: 10 entries per batch, 5 parallel batches, 5-second inter-group delay
  • Timing: Wall-clock elapsed time from subprocess launch to completion

CodeRabbit

CodeRabbit was evaluated using its CLI tool (coderabbit review --plain), running locally against the same in-repo diff simulation used for Claude Code. The script followed an identical git manipulation workflow: remove target files → commit removal → restore from original commit → stage changes → run review.

Authentication and rate limits: CodeRabbit's CLI enforces a limit of 5 reviews per hour, so the runner processed entries in strict batches of 5 with a mandatory 1-hour sleep between batches, making it the most time-constrained provider in the evaluation.

Output parsing: CodeRabbit produces plain-text output with =-delimited sections, each containing structured fields (File:, Line:, Type:, Comment:, Prompt for AI Agent:). A script extracted these into normalized issue objects with file paths, line ranges, comment text, and suggested AI prompts.

  • Invocation: coderabbit review --plain (CLI, local execution)
  • Batch processing: 5 entries per batch, 1-hour inter-batch delay (rate limited)
  • Timing: Wall-clock elapsed time from subprocess launch to completion

Semgrep

Semgrep served as the static analysis baseline, a deterministic, rule-based scanner with no LLM component. A script ran semgrep scan with --config auto (Semgrep's curated rule registry) directly against each CVE's vulnerable file(s).

Unlike the other providers, Semgrep required no git manipulation, PR creation, or prompt engineering. Each file was scanned independently, and the JSON output was parsed for rule matches including rule ID, severity, message, and precise start/end positions.

Parallelism: Semgrep's lightweight, deterministic execution allowed for significantly higher concurrency, up to 10 concurrent scans via asyncio semaphore, with no rate limiting concerns. Both unfixed and fixed variants were processed, split evenly when a maximum CVE count was specified.

  • Invocation: semgrep scan <files> --json --config auto
  • Parallelism: Up to 10 concurrent scans
  • Timing: Extracted from Semgrep's internal profiling_times.total_time field, falling back to wall-clock time

Artifacts

All judged results from the security benchmark are publicly available in the DeepSourceCorp/benchmarks repository.

ToolJudged Results
Claude Codeclaude-code.jsonl
CodeRabbitcoderabbit.jsonl
Codexcodex.jsonl
Cursor BugBotcursor-bugbot.jsonl
DeepSourcedeepsource.jsonl
Devindevin.jsonl
Greptilegreptile.jsonl
Semgrepsemgrep.jsonl

Each JSONL file contains the evaluation output from the LLM judge (Claude Opus 4.5), which compares each tool's detected issues against the known CVE vulnerability description and determines whether findings are True Positives, False Positives, True Negatives, or False Negatives.

Scoring and Judging Protocol

Judge: Claude Opus 4.5

Each CVE/variant pair is scored as a single binary outcome — True Positive, False Positive, True Negative, or False Negative. Regardless of how many comments, findings, or files a tool flags on a given entry, the question is singular: did this tool correctly identify the specific CVE vulnerability? A tool that posts 15 comments on a PR receives the same credit as one that posts a single precise finding, and there is no partial credit or weighted scoring.

Matching rules. A detected issue counts as a "hit" when it satisfies three criteria simultaneously: (1) it describes the same security impact as the CVE (e.g., remote code execution, data exfiltration, authentication bypass), (2) it involves the same attack pattern (e.g., unsanitized user input reaching a SQL query, path traversal via user-controlled filename), and (3) it identifies the exact vulnerability instance described by the CVE — not a similar issue elsewhere in the codebase. Line numbers need not match exactly; semantic equivalence suffices. A finding that identifies the vulnerable function or describes the exploitable pattern qualifies even if line references differ due to formatting.

Edge cases. If a tool posts multiple comments and at least one correctly identifies the CVE, the entry is scored as a True Positive. Generic warnings (e.g., "add input validation") do not count unless they specifically describe the CVE's attack vector. Findings in the correct vulnerability category but pointing at the wrong location (e.g., "XSS in login form" when the CVE is "XSS in comment field") are not matches. Commentary on fix quality or partial fix criticism does not constitute detection of the original vulnerability.

LLM judge (prompt). All verdicts were produced by the LLM judge (Claude Opus 4.5), which received a structured prompt containing the CVE description (fetched from the OSV API when available, otherwise from the dataset's original explanation), the variant label (fixed or unfixed), and all detected issues with their file paths and explanations. The judge was instructed to determine whether any detected issue matches the exact CVE vulnerability — not just the same category — focusing on security outcome equivalence rather than terminology (e.g., "injection" ≈ "unsanitized input" ≈ "uncontrolled data in query"). For fixed variants, the judge verified that findings describe the original vulnerability rather than concerns about fix completeness. Critically, the judge did not see the tool's name, eliminating potential bias toward or against specific vendors. The judge output structured JSON with per-issue reasoning and an overall cve_matches_any_issue verdict.

Dataset

The benchmark corpus comprises 165 CVE entries filtered from the broader OpenSSF vulnerability dataset. Each entry was required to meet three criteria:

  • Dual-commit availability: the CVE must have both prePatch and postPatch commits documented, enabling construction of both vulnerable (unfixed) and remediated (fixed) variants
  • Ground truth present: the CVE must include a non-empty weaknesses field (typically CWE identifiers), providing authoritative classification of the vulnerability type for evaluation purposes, and
  • Tractable file size: the affected file(s) must not exceed 1,000 lines of code, keeping benchmark execution duration reasonable across all tools and enabling consistent analysis depth.

These filters prioritize evaluation quality over dataset size. Every entry has verifiable ground truth and reproducible vulnerable/fixed states.

Unlike synthetic benchmarks where vulnerabilities are artificially constructed, every entry in this dataset represents a real security incident: code that shipped to production, was discovered to be vulnerable, received a CVE identifier, and was subsequently patched by maintainers. This grounds the evaluation in practical security relevance rather than theoretical vulnerability patterns.

The AI Code Review Platform
for fast-moving teams and their agents.

14-day free trial, no credit card needed
For growing teams and enterprises