General LiquidityGeneral Liquidity

Introducing SharpeBench

Give a thousand monkeys a quarter of market data and one of them will look like Renaissance. Rank traders by return over a short window and you are mostly ranking the variance of noise, not skill.

Today we are open-sourcing SharpeBench, a luck-robust benchmark for AI trading agents. It is a single deterministic binary that takes any agent, in any language, and scores it not on how much it made, but on whether its edge is real. The crates are on crates.io and the methodology is below.

We built it because the field is being measured on sand. A 2026 audit of nineteen LLM-trading studies found that zero reached full reproducibility, only two used time-consistent train and test splits, and exactly one modelled transaction costs. When the scoreboard rewards the luckiest run on the friendliest window, that is what the research optimises for. For a coding agent, a flattering benchmark wastes some time. For a capital agent it actively selects the strategy most likely to blow up, the one whose backtest looked best because it overfit hardest.

Leaderboards Rank Luck

Finance has known for a long time that performance and skill are not the same thing. When Barras, Scaillet and Wermers applied a false-discovery correction to two decades of mutual funds, they found that roughly three quarters had no genuine alpha at all, and that most of the apparent winners were false positives. Fama and French reached the same conclusion with a different method: once you account for the sheer number of funds, very few beat their benchmark by more than luck would predict. The lesson is old and well tested. A track record is evidence, but a short one, drawn from a large pool of candidates, is weak evidence.

The current generation of AI-trading benchmarks inherits none of this caution. FinBen, one of the most cited, ranks GPT-4 first on its trading task, yet its own reported Sharpe ratio is 1.51 with a confidence interval of plus or minus 1.08, a band running from about 0.43 to 2.59. An interval that wide cannot separate the agent it places first from the middle of the field. StockBench evaluates on a single four-month window and ranks on cumulative return and drawdown, with no repeated-run reliability and no deflation. QuantBench names overfitting as an open problem and then ranks pipeline performance with no luck-corrected metric to address it.

None of this is surprising once you look at how easy the numbers are to manufacture. Bailey, Borwein, Lopez de Prado and Zhu showed that a high backtest Sharpe is trivially achievable by trying enough configurations: with only a handful of years of data, a few dozen variations are enough to produce a spurious two-plus Sharpe that vanishes out of sample. Harvey and Liu turned this into a practical correction, a Sharpe-ratio haircut that scales with how many strategies were tried. A benchmark that does not apply that haircut is not measuring skill. It is running an overfitting contest and handing the trophy to the best overfitter.

Skill That Survives Deflation

The Sharpe ratio quoted everywhere is just the sample mean of returns over their standard deviation. It says nothing about how many strategies you tried before you found that one, how long the track is, or how fat-tailed the returns are. SharpeBench adds, as ranking gates, the corrections that fix exactly this.

The first gate is the Deflated Sharpe Ratio of Bailey and Lopez de Prado. Its building block, the Probabilistic Sharpe Ratio, asks a sharper question than “how big is the Sharpe?”: given the track length and the return distribution’s skew and kurtosis, what is the probability that the true Sharpe exceeds a benchmark?

Probabilistic Sharpe Ratio: Phi of (SR minus SR-star) times root(n minus 1), over the root of one minus skew times SR plus (kurtosis minus one) over four times SR squared.

The deflated version sets that benchmark to the Sharpe you would expect from the best of N random trials, so every additional strategy you try raises the bar the survivor must clear.

The deflation benchmark: the expected maximum Sharpe across N trials, scaled by the dispersion of the trial Sharpes, using the Euler-Mascheroni constant.

The effect is easiest to see on a single track. Hold one agent’s returns fixed and ask what its deflated score is as the number of strategies that were tried before it grows. The score is near certainty when the track stands alone, falls through the rank-eligibility bar once it is the best of a few dozen, and collapses toward zero once it is the best of hundreds. The agent did not get worse. The context did.

A line chart: the deflated Sharpe of one fixed track falls from about 1.0 toward 0 as the number of strategies tried (N, log scale) grows, crossing the 0.95 eligibility bar near N equals 55.
The same return track, scored against an ever-larger pool of trials. It is convincing as a one-off and indistinguishable from luck once it is the survivor of hundreds of attempts. Computed directly with the SharpeBench kernel formulas (skew and kurtosis set to normal for clarity).

A single number, however well corrected, can still hide a strategy that works on most runs and detonates on the rest. So the second gate is reliability. SharpeBench borrows pass​k from Sierra’s agent-reliability work, and the distinction matters: pass@k asks whether an agent succeeds at least once in k tries, which flatters; pass​k asks whether it succeeds on every one of k tries, which is what you need before you hand it capital. An agent must clear the per-run bar on every seed and every window, not on average.

The third gate is field-wide significance. Per-agent deflation is not enough when a whole field is snooping the same data, so SharpeBench runs a deterministic stationary bootstrap and reports White’s Reality Check and Hansen’s Superior Predictive Ability test, with Romano and Wolf’s step-down procedure to identify which agents genuinely beat the benchmark once the whole board is accounted for. The question is not whether the leader looks good. It is whether the leader is real after you admit how many agents were tried at once.

The fourth gate is the one no return-based benchmark has: process discipline. Placing an order that never passed the risk gate, ignoring a drawdown halt, or bypassing a deny-list zeroes the entry, however good the profit and loss looks. Raw return is still reported, alongside alpha and beta attribution, confidence calibration, edge half-life, out-of-sample decay, and how crowded the agent’s positions are relative to the field. Those numbers make a high score legible. None of them can buy a rank.

Raw Return Cannot Buy Rank

Here is the whole thesis in one board. We score three agents on the same data: a skilled momentum agent, a lucky one that swung big, and a bot that matched the skilled return but bypassed its risk gate.

Bar chart of raw return for three agents. lucky-yolo has the tallest bar (0.00411) but is ineligible; skilled-momentum (0.00202) is ranked first; ungated-bot (0.00202) is disqualified for a risk-gate bypass.
Three agents on the same data. lucky-yolo posts the highest raw return and ranks ineligible, because it does not clear the bar on every run. ungated-bot matches the skilled return but placed an order that never passed the risk gate, so its entry is zeroed.
The agent with the highest raw return ranks last. Only the edge that survives deflation, reliability and process discipline is ranked at all.

A benchmark is only as good as its resistance to gaming, so SharpeBench ships a self-audit that fires five known attacks at the scorer on every run. The build fails if any one of them is not demoted.

Attack on the scorerOutcome
luck dressed up as skilldemoted
an order that bypasses the risk gateentry zeroed
exploiting the simulatordemoted
breaching the stated mandatedemoted
buying rank with raw returnrejected

Deflation Is Necessary, Not Sufficient

It would be easy to overclaim here, so we will not. A deflated Sharpe corrects for the number of trials, the length of the track and the shape of the return distribution. It does nothing, on its own, about look-ahead bias, survivorship, ignored transaction costs, or a strategy that only works in one regime. A 2026 survey of financial multi-agent evaluation catalogues exactly these as five distinct and recurring failure modes: look-ahead, survivorship, backtest overfitting, transaction-cost neglect and regime-shift blindness. Deflation answers one of them.

That is why SharpeBench is not a single statistic but a stack of legs that each close a different failure mode. The simulator is point-in-time by construction: its data store only ever returns rows at or before the decision time, so look-ahead bias is unrepresentable rather than policed after the fact. It charges fees, models market impact as the agent’s own order moving the price, and applies financing and partial-fill caps, so a strategy that only works with frictionless fills is exposed. It runs across multiple out-of-sample windows on real frozen data, currently crypto majors and US equity indices, with more to come. Reliability across seeds and windows is the reliability leg. Process discipline is the safety leg. Deflation is the luck leg. The argument is the combination, not any one piece.

This matches where the wider evaluation field is moving. The same reliability-versus-capability split that motivates pass​k is now a theme across agent research, from Anthropic’s argument for putting error bars on eval scores to a 2026 science of agent reliability that treats it as a separate axis from raw capability and finds that capability gains have not bought dependability. SharpeBench is what that principle looks like when the domain is capital rather than chat.

Forward-Attestation: Trust Without Trusting the Host

Every backtest-based score has the same weakness: you have to trust that the person reporting it did not tune the strategy to the window after seeing it. The holdout-integrity literature treats time-consistent splits as the fix, and the trading audits keep finding that almost no one applies them. SharpeBench takes the idea one step further into a mechanism we think is new for this setting, and which we are putting forward as the project’s contribution rather than as established practice.

Before an evaluation window’s data exists, an agent publishes a cryptographic commitment to a hash of its frozen artifact. There is nothing to overfit, because the data is not out yet. When results are later scored, each entry is chained to the previous one with an HMAC signature that anyone holding the key can verify in constant time. A benchmark can therefore be operated by a single host and still be independently trusted, because trust comes from the signed, reproducible chain rather than from the host’s good word. General Liquidity hosts the board to start, and our own agent competes on it like any other entrant. Publishing a rank against a luck-robust, independently verifiable benchmark is a much stronger claim than a self-reported number.

Reproducibility is enforced underneath all of this. The scoring kernel is pure: no input or output, no system clock, no ambient randomness, and a fixed floating-point reduction order, so the same submission yields a byte-identical score on any machine, forever. Golden-value tests freeze the numbers and a determinism check in continuous integration runs the same suite twice and asserts the results match. The kernel also compiles to WebAssembly, so a TypeScript agent and the canonical Rust benchmark run the identical scorer. The number an agent sees while developing and the number the public board reports cannot drift apart.

Why a Trading Benchmark, and Why Now

AI trading agents are proliferating, and the question every operator and allocator actually needs answered is whether a given agent has skill or just luck, over a horizon short enough to matter and a process strict enough to trust with capital. The field is converging on live, real-time arenas, which is progress, but the new entrants still rank realized performance without deflating it for selection. Even the most relevant public board, the FINOS-governed Open FinLLM Leaderboard, measures financial knowledge, things like sentiment, question answering, named-entity recognition and stock-movement classification, scored on accuracy and F1. It has no live trading-performance axis at all.

SharpeBench is the skill-versus-luck trading track that is missing. It is the same discipline we build into General Liquidity’s own agents, deflation, reliability, process gates and forward-attestation, turned into a public and neutral measuring stick. The deeper bet is that finance is where agentic software has to become real fastest, because markets give back a hard, fast, unforgiving signal that most software categories do not. A benchmark that refuses to be gamed by luck is part of how that signal stays honest.

Open, Reproducible, and Yours to Run

SharpeBench is a Rust workspace, open-source today under Apache and MIT, on crates.io (cargo install sharpebench) and GitHub. Agents are external and language-agnostic: a container or an HTTP endpoint, in any language. The harness sends a point-in-time market observation and the agent returns a decision. That is the entire contract. Point it at your agent and rank it into the field next to a luck floor of random monkeys, so you can see the zero-skill distribution your edge has to clear.

The road from here is about breadth and neutrality, not more machinery: more real data across asset classes, a live forward arena where commitments are made before each window opens, and a path to third-party governance so the board belongs to the field rather than to us. The scoring core is done, deterministic and reproducible. The interesting work now is getting the world to compete on it.

Other leaderboards rank the luckiest run over one quarter. SharpeBench ranks the skill that survives deflation, and proves it forward.