We Tested 4 Secret Scanners on 8 Real Repos. The Results Were Eye-Opening.
We scanned 2.2 million lines of code across 8 popular open-source repositories using Gitleaks, TruffleHog, GitGuardian, and Puaro. Here are the raw numbers — and what they actually mean.
We Tested 4 Secret Scanners on 8 Real Repos. The Results Were Eye-Opening.
You've probably heard the advice: "Add a secret scanner to your pipeline." Good advice. But which one? And once it's running — can you actually trust what it tells you?
We ran a benchmark to find out.
We took 8 popular open-source repositories, scanned 2.2 million lines of code, and ran four tools against all of it: Gitleaks, TruffleHog, GitGuardian, and Puaro. Then we looked at the raw numbers.
The results surprised us — not because any tool was secretly bad, but because of how differently they answered the same question: "Is this a real security threat?"
The problem nobody talks about: alert noise
Before the numbers, a quick framing.
When a secret scanner flags something, it's saying: "There might be a sensitive credential here." That flag is either correct (a real threat) or incorrect (a false positive — a test value, a placeholder, a code example).
The problem? Most tools fire alerts without telling you which is which. You get a list of 8,449 items and you're on your own.
That's not a small problem. It's a fundamental one.
Alert fatigue is real. When developers see hundreds of false positives, they start ignoring alerts entirely. The tool that cried wolf doesn't protect you — it just creates noise.
The headline number: OpenSSL
We start with OpenSSL — the most important cryptography library in the world, used in almost every application that handles secure connections. If anything is a real-world stress test for a scanner, it's this.
Here's what each tool flagged:
| Tool | Alerts fired | What it told you about them |
|---|---|---|
| Gitleaks | 8,449 | Nothing. A flat list. |
| GitGuardian | 675 | Nothing. A flat list. |
| TruffleHog | 349 | Nothing. A flat list. |
| Puaro | 15 | Severity level + reasoning for every single one. |
Puaro found 15 alerts. That's a 99.8% noise reduction compared to Gitleaks.
More importantly: of those 15, we found 5 actual threats — 2 critical and 3 high severity. Every finding came with an explanation of why it matters.
Here's an example. Two private cryptographic keys were found in OpenSSL's fuzzing test suite:
"Contains a standard PEM header for an EC private key. Part of a fuzzing harness for OpenSSL using static test keys. Private keys are highly sensitive assets that should never be committed to version control."
And one more, rated HIGH:
"PEM-encoded DSA private key in a fuzzing test suite. Non-production context, but committing raw private keys risks accidental propagation to production."
Notice what's happening: the AI isn't just pattern-matching. It's reading the context — which file, what kind of project, what the code around it looks like — and then explaining its reasoning in plain English.
The full benchmark: all 8 repos
We didn't just test OpenSSL. Here's the complete data across 8 repositories — ranging from crypto libraries to intentional vulnerability demos.
| Repository | Gitleaks | TruffleHog | GitGuardian | Puaro |
|---|---|---|---|---|
| openssl | 8,449 | 349 | 675 | 15 (2 critical, 3 high) |
| trufflehog | 1,211 | 1,793 | 1,216 | 1,619 (all low) |
| gitleaks | 2 | 181 | 320 | 413 (all low) |
| bitcoin | 28 | 0 | 15 | 90 (all low) |
| juice-shop | 49 | 3 | 131 | 51 (1 high, 50 low) |
| wrongsecrets | 38 | 19 | 40 | 40 (all low) |
| AgentDefense | 14 | 1 | 12 | 7 (4 high, 2 medium) |
| BenchmarkJava | 2 | 194 | 16 | 1 (1 low) |
| TOTAL | 9,793 | 2,540 | 2,425 | 2,236 |
A few things stand out immediately.
Gitleaks fires almost 10,000 alerts across 8 repos. That's not a typo. In a single repo (OpenSSL), it fires 8,449 times. If your security team is working through Gitleaks output manually, that's not a security process — that's a full-time job.
TruffleHog found zero threats in bitcoin. The bitcoin core codebase is one of the most security-conscious projects in existence. It has real test keys and sensitive data in its repo. TruffleHog missed all of it.
Puaro found 7 high and medium severity issues in AgentDefense that every other tool either missed or buried in noise.
A study in intelligence: OWASP WrongSecrets
One data point deserves special attention because it shows something the raw numbers alone don't capture.
OWASP WrongSecrets is a project specifically designed to contain fake secrets for training purposes. It's full of obviously fake credentials with names like youCantHandleThisSecret. Every tool found roughly the same number of alerts:
- Gitleaks: 38
- TruffleHog: 19
- GitGuardian: 40
- Puaro: 40
Same count. Very different output.
GitGuardian, Gitleaks, and TruffleHog hand you a list. They say "these look like secrets" and stop there.
Puaro classifies every single one as Low severity and explains why:
"The repository name 'wrongsecrets' is a known benchmark tool. The value is a human-readable phrase rather than a high-entropy credential. Risk is minimal."
That's the difference between a fire alarm and a fire fighter. One tells you something might be burning. The other tells you it's probably the toaster, not the building.
The goal of secret scanning isn't to find everything that looks like a secret. It's to find the things that are actually dangerous — and tell you clearly which ones those are.
The key numbers at a glance
(15 vs 8,449 alerts)
across 8 repos
(2 critical + 3 high)
severity or explain findings
Why the numbers look this way: how Puaro works
The benchmark results aren't an accident. They come from a fundamentally different approach to scanning.
Most secret scanners work in one step: search through code for strings that match known patterns. If it matches, fire an alert. End of story.
Puaro runs four stages before declaring anything a finding.
Stage 1: Filter out the obvious noise
Before even looking for secrets, Puaro throws away everything that clearly isn't dangerous. Test directories, documentation folders, example files, placeholder values, code comments. The vast majority of what a generic scanner would flag never even reaches the next stage.
Stage 2: Find real candidates
From what's left, Puaro uses provider-specific detection patterns — targeting the actual formats of real credentials across dozens of services — to find genuine suspects. This is far more targeted than generic pattern matching.
Stage 3: Ask AI to reason about it
Here's the step the other tools skip. Puaro's AI reasoning engine evaluates each candidate in full context: which file is this in, what kind of project is this, what does the surrounding code look like?
The AI doesn't just say "this looks like a key." It reasons about it:
- Is this a test fixture?
- Is it a configuration default that ships with many projects?
- Is the entropy high enough to be a real credential?
- Does the surrounding code suggest this is a real usage or a documentation example?
The output is a structured verdict: severity (critical, high, medium, low), a confidence level, and a plain-English explanation.
Stage 4: Trace where it goes
For real findings, Puaro goes one step further. It uses code analysis to map the "blast radius" — where is this secret used? Does it flow to a network call? Does it get logged anywhere? What would an attacker be able to do with it?
The result: By the time the AI makes a decision, it's only evaluating a highly filtered set of genuine suspects — not your entire codebase. That's why the output is signal, not noise.
"Isn't this just asking ChatGPT about your code?"
Fair question. The short answer is no.
A "GPT wrapper" approach would send your entire codebase to an AI and ask it to find secrets. That approach has real problems: it's expensive, it's slow, AI models can invent findings that don't exist (called hallucinations), and it gives you no way to verify the results.
Puaro's approach is the opposite. The AI only sees candidates that have already passed through three rounds of deterministic filtering. By the time AI reasoning happens, the hard work is done — the AI is the final check, not the primary scanner.
| GPT Wrapper | Puaro (Compound AI) | |
|---|---|---|
| How it works | Code → AI → Results | Code → Filters → Patterns → AI → Analysis → Results |
| AI sees | 100% of your code | Only pre-filtered suspects |
| Hallucination risk | High — no verification layer | Low — deterministic filters run first |
| Reliability | Entirely dependent on the LLM | Multi-layer system with built-in redundancy |
What only Puaro gives you
Here's the feature comparison from the benchmark:
| Capability | Puaro | Gitleaks | TruffleHog | GitGuardian |
|---|---|---|---|---|
| AI-powered detection | ✓ | ✗ | ✗ | partial |
| Severity classification (Critical / High / Medium / Low) | ✓ | ✗ | ✗ | ✗ |
| Plain-English reason for every finding | ✓ | ✗ | ✗ | ✗ |
| Secret lifecycle analysis (where it flows in code) | ✓ | ✗ | ✗ | ✗ |
| Remediation guidance per finding | ✓ | ✗ | ✗ | ✗ |
| Real-time PR scanning | ✓ | CI only | ✓ | ✓ |
| Secret flow visualization | ✓ | ✗ | ✗ | ✗ |
Every other tool in this benchmark outputs a flat list. You get a location ("line 42 in config.js") and a type ("potential AWS key"). No severity. No reason. No guidance.
Puaro is the only tool in the benchmark that answers the question developers actually need answered: "Should I be worried about this right now?"
What this means in practice
Let's make this concrete.
Imagine your team scans a 200,000-line codebase. Here's what you're likely looking at with each tool:
-
Gitleaks → You receive 1,000+ alerts, no severity, no context. Your most junior developer gets assigned to triage. They close half as false positives (guessing). The other half sit unresolved for weeks.
-
TruffleHog → Fewer alerts because it verifies credentials against providers. But it only works for credential types it knows about. Anything else gets missed or shown unverified.
-
GitGuardian → A cleaner interface than the others, but the same fundamental problem: no reasoning, no severity, no guidance. A list is still a list.
-
Puaro → Alerts come with severity levels and explanations. Your team only acts on critical and high findings first. Everything is triaged automatically. The developer who receives the alert knows exactly what to do.
The bottom line: A finding you understand is one you can fix. A finding you can't understand is just noise — and noise gets ignored.
The honest takeaway
The other tools in this benchmark aren't bad. Gitleaks is fast and free, which makes it a reasonable pre-commit gate. TruffleHog's credential verification is genuinely useful for deep historical audits. GitGuardian has a polished interface and solid detection.
But none of them answer the question that matters: "Is this actually dangerous?"
That gap — between finding something and understanding it — is where most security breaches actually happen. Not because the tool missed the secret, but because the alert was buried in noise and nobody got to it in time.
The benchmark numbers tell that story clearly. 9,793 alerts from Gitleaks across 8 repos. 15 alerts from Puaro in the same repos, every one explained, every one classified.
You can't act on 10,000 alerts. You can act on 15.
Want to see what Puaro finds in your own codebase? Start scanning free — setup takes under 5 minutes, no credit card required.