Asking Deep Funding Jurors Better Questions

17.10.2025 — Deep Funding, Open Source, Public Goods, Funding, UX Design — 7 min read

Individual Comparisons

CONSENSUS PROCESSOR

Consensus Weights

Deep Funding runs on a simple premise: break down the impossible task of ranking thousands of projects into manageable pairwise comparisons. Ask experts which project is more valuable and by how much, then use those judgments to allocate funding. The approach has elegance, but the data from the first contest reveals a systematic problem. When jurors compare projects, they consistently overestimate differences--by a factor of three on average.

Let me explain. The median multiplier jurors provide is 5x, meaning they believe one project is five times more valuable than another. But when we mathematically reconcile all comparisons to find the best-fit distribution, what I call the "implied consensus,"¹ the median ratio between projects is only 1.8x. This pattern of systematic overestimation aligns with findings from Optimism's Retro Funding, where researchers found that 'humans are good at relative comparisons but bad at outright comparisons'.

In this article I want to examine this problem and suggest that we can solve it by asking jurors a few additional questions, and giving them tools and the opportunity to improve their answers.

Jurors Overestimate Value Differences

In Deep Funding, prominent community members compared project pairs. But their comparisons don't add up. When reconciled, the math shows projects are much more similar than jurors say.

Jurors say: "5x different"

Math says: "1.8x different"

Over half of possible project comparisons are nearly equal in value (within 2x). But jurors only recognize this similarity 12% of the time.

Multiplier inflation follows predictable patterns

To understand the implied consensus, I calculated consensus weights using the approach originally suggested by Vitalik. According to the consensus weights, 57% of all possible project pairs differ by less than 2x. Therefor, jurors should expect that the majority of the time they are receiving a pair of projects that are of roughly comparable value, where careful discrimination matters most. Yet when jurors provide comparisons, only 7.6% of their multipliers fall below 2x. Even more telling, 47.4% of juror comparisons fall between 3x and 10x, while in the consensus distribution, only 19.4% of pairs have ratios in this range—showing a 2.4x overrepresentation of moderate differences. The median inflation factor is 2.75x, meaning jurors systematically report differences nearly 3 times larger than what the consensus weights imply.

I don't think this is random noise. When humans compare quantities without clear reference points, it's easy to favor nice round numbers and systematically overestimate ratios. And the comparison format itself seems to prime for finding differences. Asking "how many times more valuable" one project is versus another pushes jurors to articulate significant differences that may not meaningfully exist.

It's a bit surprising, but the organizers of deep funding see this as a feature of deep funding. Previous funding mechanisms struggled with the opposite problem, producing overly flat distributions where deserving projects weren't adequately differentiated. As documented in OSO's analysis of Retro Funding Round 5, even expert voters tended toward 'peanut butter spread' allocations that didn't really reflect the considerable differences in value between projects.

But even if this is an improvement over previous approaches, surely we can do better.

Better questions can produce better data

I'd like to propose three alternative questions and approaches that could supplement the current approach:

Establishing a personal scale through range definition. Before making individual comparisons, ask each juror to identify what they consider the most and least valuable projects in the set, and estimate the ratio between them. If a juror believes the top project is 1000x more valuable than the bottom, we immediately know their personal scale. When they later say one project is 5x more valuable than another, we can interpret that in context. This creates what economists call a numeraire, a consistent unit of measurement that makes all other comparisons interpretable.

Personal Scale Definition

Before making comparisons, help us understand your scale by identifying the extremes:

How many times more valuable is the top project compared to the bottom?

The power of this approach becomes clear with simple math. If a juror's total range is 10,000x across 45 projects, then adjacent projects in their mental ranking differ by less than 1.3x on average. This context would immediately reveal when a 500x multiplier between two random projects seems unreasonable.

Actively seeking comparisons between similar projects. The current random pairing approach misses the most informative comparisons. Instead of only asking about random pairs, we could prompt jurors with targeted questions: "Name a project approximately equal in value to Project X, within a factor of two." This directly addresses the gap in our data where 45% of comparisons should exist but currently don't.

Similar Project Identification

Instead of random pairings, let's find projects of similar value:

Target Project:

Name a project of equal or greater value to , within a factor of two (none is an option):

How many times more valuable is the selected project? (1.0 = exactly equal)

This approach acknowledges that the hardest and most important discriminations are between similar projects. Finding which projects cluster together provides more information than knowing the obvious gap between the best and worst. It also gives jurors a task that feels more natural than quantifying vast differences.

Putting comparisons in context. The current system presents pairs in isolation: "Which has been more valuable to Ethereum's success?" But jurors lack the context to calibrate their responses. We could provide that context alongside the question itself.

When a juror compares two projects, show them multiple reference points immediately. Display where their multiplier falls in the distribution: "A 50x multiplier would be in the 95th percentile of all comparisons." Show what the reconciled consensus suggests: "Aggregate data indicates Project A is 2.3x more valuable than B." Include an AI-trained model's opinion as another data point: "Based on patterns in similar comparisons, the model suggests 3x."

Most crucially, translate their comparison into funding implications. A 50x multiplier might result in Project A receiving 25% of total funding while Project B gets 0.5%. Seeing these percentages helps jurors understand what their numerical choice actually means. They might realize their intent was different - perhaps they meant A was meaningfully better but not dominant, or perhaps they actually think B deserves more recognition.

The interface becomes part of the decision process, not a post-hoc review. Jurors see alternative scenarios (what would 2x, 5x, or 10x look like?) and can adjust in real time. The AI opinion serves as one reference point among many - useful for calibration but not authoritative. Some jurors will maintain extreme multipliers after seeing context, and that's valuable signal. Others will adjust once they understand the implications.

This approach acknowledges that asking "how many times more valuable" without context is like asking someone to estimate distance without reference points. By providing multiple anchors - statistical, algorithmic, and consequential - we help jurors express what they actually believe.

Small changes could yield large improvements

These aren't radical departures from Deep Funding's approach. The core insight remains sound: decompose complex evaluation tasks into simple comparisons. These modifications simply acknowledge what the data reveals about how humans actually make these comparisons.

The systematic inflation we observe isn't a failure of judgment. When we reconcile all comparisons, a coherent consensus emerges, suggesting jurors do share common beliefs about relative value. The challenge is extracting those beliefs through better measurement instruments. The difference between asking "how many times more valuable" versus "find a project of similar value" might seem minor, but it could reduce measurement error significantly.

I believe Deep Funding has shown that comparative judgment can work. The next step is refining the questions we ask to get answers that better reflect what jurors actually believe. With millions of dollars in potential funding at stake, even modest improvements in measurement accuracy translate to better support for the projects that truly drive the ecosystem forward. By asking better questions, we can achieve far better funding of public goods.