Benchmarks

DeepSource is an AI code review platform purpose-built for deep code review and providing comprehensive feedback on code quality and security to AI agents. On the public OpenSSF CVE Benchmark, our novel hybrid analysis engine delivers industry-leading performance. We're excited to present our results, the methodology, and raw, verifiable data.

Skip to the good part? View the raw data for this benchmark run here: github.com/deepsourcecorp/benchmarks

Security Review

Accuracy on OpenSSF CVE Benchmark

DeepSource

82.42%

OpenAI Codex

81.21%

Devin Review

80.61%

Cursor BugBot

78.79%

Greptile

73.94%

Claude Code

71.52%

CodeRabbit

61.21%

Semgrep (CE)

58.18%

DeepSource leads on overall accuracy at 82.42%, edging out Codex (81.21%) and Devin Review (80.61%).

The benchmark consists of 165 real-life security vulnerabilities in JavaScript and TypeScript drawn from the OpenSSF CVE Benchmark, which have been validated and fixed in open-source projects. It evaluates tools on two key metrics: their ability to detect the vulnerability (avoiding false negatives) and their ability to recognize the validated patch (avoiding false positives).

We're choosing accuracy as the hero metric for our evaluation. Accuracy measures how often the agent gets it right: detecting real vulnerabilities in vulnerable code, and recognizing that patched code is actually fixed.

Here are the full results across all tools we evaluated:

	DeepSource	Cursor BugBot	Semgrep	CodeRabbit	Claude Code†	Greptile	Devin	Codex‡
Diffs reviewed	165	165	165	165	165	165	165	165
Accuracy	82.42%	78.79%	58.18%	61.21%	71.52%	73.94%	80.61%	81.21%
Precision	90.77%	74.23%	74.07%	100%	90.7%	85.45%	89.06%	94.74%
Recall	71.95%	87.80%	24.39%	21.95%	47.56%	57.32%	69.51%	65.85%
F1 Score	80.27%	80.45%	36.70%	36.00%	62.40%	68.61%	78.08%	77.70%
Avg. time per diff	143.77s	189.88s	90s	124.81s	43.92s	165s	148.2s	208.68s

† Claude Code was tested using Opus 4.5. ‡ Codex was tested using their cloud PR review, which uses GPT-5-Codex.

On F1 score, the metric that balances detection rate against false alarms, DeepSource scores 80.27%, within 0.2 points of Cursor Bugbot's 80.45%. The difference: Bugbot achieves its F1 through higher recall (87.80%) at the cost of significantly lower precision (74.23%), meaning more noise per real finding.

DeepSource maintains 90.77% precision while still catching 71.95% of vulnerabilities. This profile is better suited to production CI/CD where false positives erode developer trust.

Aggregate scores tell you how a tool performs on average. But the real test is whether it catches the subtle, high-impact vulnerabilities that matter most — the ones that slip past everything else. Across the 165 CVEs in this benchmark, DeepSource's hybrid analysis engine detected critical security issues that every other tool missed entirely.

Open redirect via protocol-relative URLs

CVE-2017-16224: DeepSource was the only tool to catch this. All 7 other tools missed it.

The st static file server's autoindex function issues a 301 redirect when a request is missing a trailing slash — using the request URL directly in the Location header without validation:

Mount.prototype.autoindex = function (p, req, res) {
  if (!/\/$/.exec(req.sturl)) {
    res.statusCode = 301
    res.setHeader('location', req.sturl + '/')
    res.end('Moved: ' + req.sturl + '/')
    return
  }

An attacker can craft a request like //evil.com/%2e%2e. Browsers interpret the // prefix as a protocol-relative URL, silently redirecting users to an attacker-controlled domain.

DeepSource's finding: The autoindex function is vulnerable to an open redirect. When redirecting requests missing a trailing slash, it uses req.sturl directly in the Location header. An attacker can provide a URL starting with // to redirect users to a malicious external site.

Every other tool focused on path traversal issues in the same codebase, a different vulnerability class entirely. None recognized the protocol-relative redirect pattern that makes this exploitable.

XSS filter bypass via encoded control characters

CVE-2019-1010091: DeepSource was the only tool to catch this. All 7 other tools missed it.

TinyMCE's SaxParser sanitizes dangerous URIs using a regex filter:

const scriptUriRegExp = /((java|vb)script|mhtml):/i;

The filter decodes the URI before checking it against this regex, but URL-encoded control characters embedded within the scheme name (e.g., java%0dscript:alert(1)) cause the regex to fail while browsers still interpret the URI as executable JavaScript.

DeepSource's finding: The filter for javascript: URIs can be bypassed using URL-encoded control characters (e.g., %0d). These characters are decoded before the regex check, but the regex does not account for them.

Four tools (Claude Code, CodeRabbit, Greptile, Semgrep) found zero issues. Two others flagged unrelated internal markup handling bugs. Detecting this vulnerability requires understanding how browser URL parsing interacts with multi-step sanitization logic — the kind of deep semantic analysis that distinguishes genuine security review from surface-level scanning.

Type system manipulation via prototype chain

CVE-2019-19507: DeepSource was the only tool to catch this. All 7 other tools missed it.

The jpv (Json Pattern Validator) library uses constructor.name to verify types during validation:

if (pattern !== null && value !== null) {
    return res(value.constructor.name === pattern.constructor.name)
}

An attacker can forge the constructor.name property by passing { constructor: { name: "Array" } }, making the validator believe a plain object is an Array and bypassing all type-based validation logic.

DeepSource's finding: The application uses constructor.name for type checking, which is unreliable as it can be easily spoofed on a JavaScript object. This could allow an attacker to bypass type validation by providing a malicious object with a forged constructor name.

Three tools found nothing. Others flagged unrelated issues — ReDoS, error handling, code quality. Catching this vulnerability requires understanding JavaScript's prototype chain mechanics and how type system assumptions can be subverted through property injection.

These three case studies illustrate a pattern: DeepSource consistently identifies vulnerabilities that require multi-step reasoning about how attacker-controlled data flows through language-specific runtime behaviors, such as protocol-relative URL semantics, multi-pass sanitization interactions, and prototype chain manipulation. These are precisely the classes of vulnerabilities that cause real-world security incidents, and they are the hardest to catch.

Secrets Detection

F1 Score on proprietary secrets detection dataset

DeepSource

92.78%

Gitleaks

75.62%

Detect-Secrets

54.35%

TruffleHog

41.22%

DeepSource dominates the secrets benchmark with a 92.78% F1 score. It detected 453 out of 518 secrets with only 6 false positives — yielding 98.69% precision and 87.45% recall.

By contrast, Detect-Secrets matched DeepSource's recall (87.45%) but at the expense of 696 false positives, pushing its precision down to 39.43%. TruffleHog was the most conservative, catching only 27.41% of secrets but with 83.04% precision. The gap here is stark: DeepSource is the only engine that simultaneously maintains >98% precision and >87% recall, making it production-ready without manual triage.

	DeepSource	Gitleaks	Detect-Secrets	TruffleHog
Perfect Matches	453	303	377	121
Partial Matches	0	18	76	21
Missed Secrets	65	197	65	376
False Positives	6	10	696	29
Accuracy	87.45%	58.49%	72.78%	23.36%
Precision	98.69%	96.98%	39.43%	83.04%
Recall	87.45%	61.97%	87.45%	27.41%
F1 Score	92.78%	75.62%	54.35%	41.22%

Methodology

To ensure a fair and reproducible comparison across all providers, we developed a standardized evaluation harness built around real-world CVE data. Each tool was given the same set of vulnerable code samples, sourced from the OpenSSF CVE dataset, and evaluated on its ability to surface the known security issues. Every provider's pipeline was fully automated, with consistent PR structures, identical code diffs, and a unified judging framework applied to all results.

The dataset consists of 165 entries spanning both unfixed (vulnerable) and fixed (patched) variants of each CVE, allowing us to measure both true positive detection and false positive suppression.

Below, we detail how each provider was integrated and invoked.

Shared eval infrastructure

For all PR-based review tools (Greptile, Bugbot, Codex, and Devin), we used a common repository scaffolding approach:

Repository creation: For each CVE entry, a fresh GitHub repository was created under a dedicated organization. The main branch was initialized with the entire source repository excluding the vulnerable file(s), establishing a clean baseline.
Feature branch construction: A feature branch (add-code) was then created with the vulnerable file(s) added, making the security-relevant code appear as new additions in the PR diff, exactly as a real contributor's code would surface during review.
Draft PR creation: A draft pull request was opened from the feature branch to main, triggering whichever automated review bot was installed on the repository.
Comment fetching: After allowing each tool time to complete its analysis, a separate fetcher script retrieved all bot-generated comments (inline review comments, issue comments, and PR review summaries) via the GitHub API. Comments were filtered by bot username and parsed into a normalized format.
Timing measurement: For each entry, we measured the analysis latency, the time elapsed between the trigger event (PR creation or explicit @bot review command) and the first bot response. This captures end-to-end wall-clock time including any queuing, indexing, or inference delays.
State management and resumability: All runners maintained a JSON state file tracking the status of each entry (processing, completed, failed), enabling safe resumption after interruptions and batch-level rate limiting to stay within API quotas.

Tool-specific details

The tools provide various interfaces for performing code reviews, some of which are CLI, REST APIs, and pull request comments. In order to avoid rate limits with VCS providers when opening PRs and retrieving comments, other interfaces, if available, were utilised.

DeepSource

DeepSource was evaluated using our code review API that powers our platform runs. The API was provided with repository bundles and the corresponding diff information to analyse. The API returned all issues detected in the diff. 25 diffs were analysed in parallel at a time. The analysis duration was measured by recording the API latency.

Greptile

Greptile was evaluated as a fully automated PR review bot. The runner script handled repository creation, branch setup, and PR opening, with Greptile's GitHub App automatically triggering analysis on each new PR. PR body directives were used to suppress other review integrations installed on the test repositories, ensuring only Greptile's output was collected.

Comment retrieval: The script fetched both inline review comments and general issue comments, filtering by Greptile's bot username (greptileai). The analysis time was calculated as the delta from the first non-Greptile comment (the trigger, typically an @greptileai mention) to the first Greptile response. Both main and retry runs were supported via separate state files to handle transient failures without corrupting existing results.

Trigger mechanism: Manual @greptileai comment on PR — PRs were opened as drafts to prevent other installed review integrations from triggering automatically, which meant Greptile's webhook-based auto-review was also suppressed, requiring manual comment triggers on most entries.
Comment types collected: Inline review comments, issue comments
Rate limiting: Batches of 5 PRs with 5-second inter-batch delays

Cursor Bugbot

Bugbot (Cursor's PR review agent) followed the same PR scaffolding approach with private repositories. The runner script created repositories, pushed code, and opened PRs that Bugbot would automatically review.

Comment retrieval: A script parsed Bugbot's distinctive comment format, which embeds file location metadata in hidden HTML comments (). These embedded location references were extracted using regex to produce structured file#L{start}-L{end} mappings, providing precise line-level attribution for each finding. The fetcher also collected PR reviews (top-level review summaries) in addition to inline and issue comments.

Trigger mechanism: Automatic on PR creation
Comment types collected: Inline review comments, issue comments, PR review summaries
Rate limiting: Semaphore-based concurrency (3 parallel), sequential batch processing

OpenAI Codex

Codex required a slightly different workflow since it operates as a comment-triggered agent rather than an automatic PR webhook. The script created duplicate branches (add-code-codex) from the existing add-code branches and opened separate draft PRs for Codex to review. A manual @codex review comment was then posted on each PR to trigger the analysis.

Comment retrieval: A script handled Codex's multi-modal response pattern: when Codex finds issues, it posts inline review comments; when it finds none, it posts an issue comment stating no problems were detected. The fetcher sorted all Codex-authored items chronologically and used the last item's type to determine the final verdict, if the last activity was inline review comments, those were collected as findings; if it was an issue comment, the entry was recorded as having zero detections. Analysis time was measured from the latest @codex review trigger to the latest Codex response.

Trigger mechanism: Manual @codex review comment on PR
Bot usernames: chatgpt-codex-connector, chatgpt-codex-connector[bot]
Comment types collected: Inline review comments, issue comments, PR review summaries
Rate limiting: Batches of 5 with 2-second delays; 3 concurrent PR creations

Devin Review

Devin Review (by Cognition) was integrated similarly to Codex, separate branches (add-code-devin) and PRs were created via a python script. Reviews were triggered using browser automation via Playwright, which transforms GitHub PR URLs to Devin Review URLs by replacing github.com with devinreview.com and opens each PR in a browser tab to initiate analysis.

The trigger script requires manual GitHub OAuth login on first run, then maintains a persistent Chromium profile for subsequent authenticated sessions. PRs are processed in batches of 5, with a 30-second wait after loading each batch to allow Devin's backend to register the review requests before closing tabs.

Comment retrieval: We used a dedicated trigger state file to record the exact timestamp of each review request. This allowed precise latency measurement from the trigger to Devin's first response, accounting for timezone normalization between local trigger timestamps and GitHub's UTC-based comment timestamps. The fetcher also captured Devin's PR review summaries (posted via the GitHub Reviews API with submitted_at timestamps), which Devin uses to indicate when no issues are found.

Trigger mechanism: Automatic via Playwright (navigates to PR URL in Chromium)
Bot username: devin-ai-integration[bot]
Comment types collected: Inline review comments, issue comments, PR review summaries
Rate limiting: Batches of 5 with 3-second delays

Claude Code

Unlike the PR-based tools, Claude Code was evaluated as a local CLI agent operating directly on the repository filesystem. A script ran Claude Code against each CVE's source repository without creating any GitHub PRs or branches.

Analysis workflow: For each entry, the runner performed an in-place diff simulation within the actual repository:

The target file(s) were removed and the removal was committed.
The files were restored from the original commit as staged changes.
A git diff --cached was generated, producing the exact diff that a PR reviewer would see.
Claude Code was invoked with a comprehensive security audit prompt via a custom slash command (/security-review), with the diff and file list embedded directly in the prompt.
After analysis, the repository was restored to its original state via git reset --hard and stash recovery.

Prompt design: Claude Code received a detailed, structured prompt specifying the security categories to examine (injection, auth bypass, crypto issues, code execution, data exposure), severity guidelines, confidence scoring thresholds (minimum 0.7), explicit exclusions (DoS, on-disk secrets, rate limiting), and a mandatory JSON output schema. The prompt also instructed a three-phase analysis methodology: repository context research, comparative analysis against existing security patterns, and vulnerability assessment with data flow tracing.

Actual Prompt: Link

Output parsing: Claude Code's raw text output was parsed for embedded JSON containing structured findings with file paths, line numbers, severity levels, category labels, exploit scenarios, and confidence scores.

Invocation: claude /security-review --permission-mode acceptEdits (CLI, local execution)
Batch processing: 10 entries per batch, 5 parallel batches, 5-second inter-group delay
Timing: Wall-clock elapsed time from subprocess launch to completion

CodeRabbit

CodeRabbit was evaluated using its CLI tool (coderabbit review --plain), running locally against the same in-repo diff simulation used for Claude Code. The script followed an identical git manipulation workflow: remove target files → commit removal → restore from original commit → stage changes → run review.

Authentication and rate limits: CodeRabbit's CLI enforces a limit of 5 reviews per hour, so the runner processed entries in strict batches of 5 with a mandatory 1-hour sleep between batches, making it the most time-constrained provider in the evaluation.

Output parsing: CodeRabbit produces plain-text output with =-delimited sections, each containing structured fields (File:, Line:, Type:, Comment:, Prompt for AI Agent:). A script extracted these into normalized issue objects with file paths, line ranges, comment text, and suggested AI prompts.

Invocation: coderabbit review --plain (CLI, local execution)
Batch processing: 5 entries per batch, 1-hour inter-batch delay (rate limited)
Timing: Wall-clock elapsed time from subprocess launch to completion

Semgrep

Semgrep served as the static analysis baseline, a deterministic, rule-based scanner with no LLM component. A script ran semgrep scan with --config auto (Semgrep's curated rule registry) directly against each CVE's vulnerable file(s).

Unlike the other providers, Semgrep required no git manipulation, PR creation, or prompt engineering. Each file was scanned independently, and the JSON output was parsed for rule matches including rule ID, severity, message, and precise start/end positions.

Parallelism: Semgrep's lightweight, deterministic execution allowed for significantly higher concurrency, up to 10 concurrent scans via asyncio semaphore, with no rate limiting concerns. Both unfixed and fixed variants were processed, split evenly when a maximum CVE count was specified.

Invocation: semgrep scan <files> --json --config auto
Parallelism: Up to 10 concurrent scans
Timing: Extracted from Semgrep's internal profiling_times.total_time field, falling back to wall-clock time

Artifacts

All judged results from the security benchmark are publicly available in the DeepSourceCorp/benchmarks repository.

Tool	Judged Results
Claude Code	claude-code.jsonl
CodeRabbit	coderabbit.jsonl
Codex	codex.jsonl
Cursor BugBot	cursor-bugbot.jsonl
DeepSource	deepsource.jsonl
Devin	devin.jsonl
Greptile	greptile.jsonl
Semgrep	semgrep.jsonl

Each JSONL file contains the evaluation output from the LLM judge (Claude Opus 4.5), which compares each tool's detected issues against the known CVE vulnerability description and determines whether findings are True Positives, False Positives, True Negatives, or False Negatives.

Scoring and Judging Protocol

Judge: Claude Opus 4.5

Each CVE/variant pair is scored as a single binary outcome — True Positive, False Positive, True Negative, or False Negative. Regardless of how many comments, findings, or files a tool flags on a given entry, the question is singular: did this tool correctly identify the specific CVE vulnerability? A tool that posts 15 comments on a PR receives the same credit as one that posts a single precise finding, and there is no partial credit or weighted scoring.

Matching rules. A detected issue counts as a "hit" when it satisfies three criteria simultaneously: (1) it describes the same security impact as the CVE (e.g., remote code execution, data exfiltration, authentication bypass), (2) it involves the same attack pattern (e.g., unsanitized user input reaching a SQL query, path traversal via user-controlled filename), and (3) it identifies the exact vulnerability instance described by the CVE — not a similar issue elsewhere in the codebase. Line numbers need not match exactly; semantic equivalence suffices. A finding that identifies the vulnerable function or describes the exploitable pattern qualifies even if line references differ due to formatting.

Edge cases. If a tool posts multiple comments and at least one correctly identifies the CVE, the entry is scored as a True Positive. Generic warnings (e.g., "add input validation") do not count unless they specifically describe the CVE's attack vector. Findings in the correct vulnerability category but pointing at the wrong location (e.g., "XSS in login form" when the CVE is "XSS in comment field") are not matches. Commentary on fix quality or partial fix criticism does not constitute detection of the original vulnerability.

LLM judge (prompt). All verdicts were produced by the LLM judge (Claude Opus 4.5), which received a structured prompt containing the CVE description (fetched from the OSV API when available, otherwise from the dataset's original explanation), the variant label (fixed or unfixed), and all detected issues with their file paths and explanations. The judge was instructed to determine whether any detected issue matches the exact CVE vulnerability — not just the same category — focusing on security outcome equivalence rather than terminology (e.g., "injection" ≈ "unsanitized input" ≈ "uncontrolled data in query"). For fixed variants, the judge verified that findings describe the original vulnerability rather than concerns about fix completeness. Critically, the judge did not see the tool's name, eliminating potential bias toward or against specific vendors. The judge output structured JSON with per-issue reasoning and an overall cve_matches_any_issue verdict.

Dataset

The benchmark corpus comprises 165 CVE entries filtered from the broader OpenSSF vulnerability dataset. Each entry was required to meet three criteria:

Dual-commit availability: the CVE must have both prePatch and postPatch commits documented, enabling construction of both vulnerable (unfixed) and remediated (fixed) variants
Ground truth present: the CVE must include a non-empty weaknesses field (typically CWE identifiers), providing authoritative classification of the vulnerability type for evaluation purposes, and
Tractable file size: the affected file(s) must not exceed 1,000 lines of code, keeping benchmark execution duration reasonable across all tools and enabling consistent analysis depth.

These filters prioritize evaluation quality over dataset size. Every entry has verifiable ground truth and reproducible vulnerable/fixed states.

Unlike synthetic benchmarks where vulnerabilities are artificially constructed, every entry in this dataset represents a real security incident: code that shipped to production, was discovered to be vulnerable, received a CVE identifier, and was subsequently patched by maintainers. This grounds the evaluation in practical security relevance rather than theoretical vulnerability patterns.

Security Review

Accuracy on OpenSSF CVE Benchmark

Open redirect via protocol-relative URLs

XSS filter bypass via encoded control characters

Type system manipulation via prototype chain

Secrets Detection

F1 Score on proprietary secrets detection dataset

Methodology

Shared eval infrastructure

Tool-specific details

DeepSource

Greptile

Cursor Bugbot

OpenAI Codex

Devin Review

Claude Code

CodeRabbit

Semgrep

Artifacts

Scoring and Judging Protocol

Dataset

The AI Code Review Platform
for fast-moving teams and their agents.

Open Source

Accuracy on OpenSSF CVE Benchmark

F1 Score on proprietary secrets detection dataset

The AI Code Review Platform for fast-moving teams and their agents.

The AI Code Review Platform
for fast-moving teams and their agents.