The evaluation gap: buy-side firms are wiring MCP agents into research workflows without a shared accuracy benchmark
Expert networks and data vendors have plugged into Claude through MCP faster than the buy-side has agreed on how to grade what comes back.

The infrastructure for agentic investment research has shipped. The grading rubric has not.
In the last six months, Guidepoint, Third Bridge, AlphaSense, and Aiera have wired transcript libraries into Anthropic's Model Context Protocol, and Daloopa, Hebbia, and Rogo have plugged structured-data and workflow layers into the same agent surface. A Claude session at a mid-sized hedge fund can now pull an expert-network transcript, a normalised fundamental, and an earnings-call excerpt in a single chain. Procurement and compliance teams have moved quickly to negotiate the connector budgets and audit trails this implies. Evaluation has not. The buy-side is deploying multi-source research agents into production workflows without a shared, public benchmark for what a correct answer looks like, and the vacuum is being filled by a handful of vendor-authored tests that were never designed for the workflow MCP actually enables.
The connector layer shipped before the eval layer
MCP is roughly a year old as a public standard and the buy-side adoption pattern has been unusually fast for a piece of finance infrastructure. The reason is straightforward. MCP gives a Claude or compatible agent a structured way to call out to an external source, retrieve content, and cite it back into the chain. For an expert network, that means an analyst inside Claude can ask for the three most relevant transcripts on automotive Tier 1 supplier exposure to a specific OEM and get them back with provenance. For a data vendor like Daloopa, it means a fundamentals query becomes a tool call rather than a separate UI session. The connector itself is not the interesting object: the interesting object is what the model does with five connectors at once.
This is where the evaluation gap opens. A traditional retrieval benchmark tests whether a system, given a question, returns the right document or the right span. FinanceBench, the most comprehensive public benchmark for financial question-answering, covers 10,000 question-answer pairs over SEC filings and grades whether a model can answer correctly from a known source. FinQA and ConvFinQA, the academic precedents, test numerical reasoning over single financial documents. Daloopa's November 2025 grounding benchmark, the most recent vendor entry, claims a 71-point advantage on retrieval-grounded tasks against its peer set. All three are useful. None of them tests the workflow the buy-side is actually running.
The workflow the buy-side is actually running looks like this. An analyst at a long-short fund opens Claude, asks a structured question about a covered name, and the agent fans out across three or four MCP-connected sources: an expert-network transcript library for primary qualitative input, a normalised fundamental data feed for numbers, an earnings-call excerpt store for management commentary, and possibly a sell-side research connector for consensus context. The agent composes a response that draws on all four, with citations. The evaluation question is not whether any single retrieval was correct. It is whether the composed answer is correct, whether the citations are faithful to what the underlying sources said, whether the agent silently dropped a contradiction between sources, and whether the same question asked tomorrow returns a consistent answer. No public benchmark grades any of that.
What the existing benchmarks actually test, and what they miss
It is worth being precise about the existing benchmarks, because they are useful inputs to a category standard even if none of them is the standard.

FinanceBench was released by Patronus AI in 2023 and extended through 2025. It is open, it is large, and it grades on the dimension that matters most for single-source RAG: can the model retrieve the right passage from a filing and answer the question without hallucinating. The Patronus team's own framing acknowledges that even high-end models struggle on the full open-book version of the test. That is a real result, and it has shaped how buy-side firms think about base-model selection. It does not test multi-source composition, conflicting-source resolution, or the agent's willingness to say it does not know.
FinQA and ConvFinQA, both academic, test numerical reasoning over financial tables and conversational follow-up. They are closer to the structure of an analyst's actual work than a pure QA benchmark, but they are still single-document. The published leaderboards are saturated by frontier models, which means they are decreasingly useful for distinguishing production systems.
Daloopa's November 2025 grounding benchmark is the newest entrant and the most explicitly vendor-positioned. The 71-point claim is on a retrieval task against a defined peer set, and the methodology, what was tested, against what, with what prompts, sits inside Daloopa's framing. We are not flagging this as a problem with the work, which is methodologically reasonable as far as the public materials show. We are flagging it as a structural feature of the current market: the most up-to-date public benchmark in the category is authored by a vendor whose product is being tested. That is how categories look before a neutral convenor arrives, not after.
The gap, then, is not that the existing benchmarks are wrong. It is that they were built for a world where one model called one corpus. The MCP-connected world calls many corpora, and the failure modes are different. A model can retrieve correctly from each of four sources and still produce a wrong composite answer. A model can produce a right composite answer for the wrong reasons, by ignoring a source it should have weighted. A model can give two analysts at the same firm different answers to the same question on the same day, because the agent's tool-call order was non-deterministic. None of these failure modes show up on a single-source QA test.
The internal golden-set economy
What is filling the gap inside firms is the internal golden-set evaluation. A handful of analysts at a buy-side firm sit down, write 200 to 500 questions they consider representative of their actual research workflow, and hand-grade the correct answers. The set becomes the firm's internal benchmark. New model versions, new connectors, new prompt templates all get run against it before promotion to production.
This is sensible practice, and it is what mature ML organisations have always done. It is also, from a category perspective, a coordination failure in slow motion. Every firm is paying the cost of building and maintaining a golden set. None of the sets are comparable to any other firm's. The vendors selling into these firms have no shared target to optimise against, which means they are partly optimising against whichever firm's golden set they have happened to see. And the consultants helping firms stand up these evaluation programs have become a category of their own: a Bloomberg report in October 2025 described ex-bankers charging Wall Street USD 25,000 per day to coach AI rollouts, with evaluation design among the workstreams.
The golden-set economy is rational at the firm level and wasteful at the category level. It also means the evaluation evidence that exists is mostly private. When a CIO asks whether the firm's research agents are getting better, the answer is anchored to a benchmark that exists only inside the firm. When the firm changes vendors, the new vendor's product is graded against a target the vendor has never seen. When regulators eventually ask how the firm validated the agent system it used to generate investment memos, the firm points to its golden set and the regulator has no external referent for whether the set was rigorous.
What a category-level standard would actually have to do
A serious multi-source agent benchmark for buy-side research would have to grade on at least five dimensions that the current public benchmarks do not cover well.
The first is faithful composition. Given a question that requires synthesising information from a transcript, a fundamental, and a filing, does the agent's answer reflect what each source actually said, and does it flag where the sources disagree. The second is source attribution. When the agent cites a source, does the cited passage actually support the claim, or has the model hallucinated a citation that looks plausible but does not appear in the underlying document. The third is calibration. When the agent is asked something it cannot answer from the available sources, does it say so, or does it confabulate. The fourth is consistency. Asked the same question across ten runs, with non-deterministic tool-call ordering, does the agent return materially the same answer. The fifth, and the hardest, is workflow realism. The benchmark questions have to look like the questions an analyst actually asks, which means they have to be co-designed with practitioners and refreshed as the workflow evolves.
None of these dimensions is novel as a research idea. Faithfulness, attribution, calibration, and consistency are well-studied in the general LLM evaluation literature. The work that has not been done is the buy-side-specific instantiation: a public corpus that combines transcripts, fundamentals, filings, and earnings excerpts in a way that mirrors the MCP-connected stack, with a question set that reflects real analyst workflows, with a grading methodology that is reproducible across vendors.
Three paths to a standard
There are three plausible conveners for a category-level evaluation standard. Each implies a different shape for what good will mean.
The first is a consortium of buy-side firms. This is the cleanest outcome on paper. A group of large asset managers and hedge funds agrees to pool a redacted golden set, fund a neutral third party to maintain it, and require their vendors to publish results against it. The historical analogue is the way the buy-side eventually coordinated on transaction-cost analysis methodology in the 2000s. The friction is that the firms with the most sophisticated golden sets consider them proprietary alpha, and the firms without sophisticated golden sets have nothing to contribute. A consortium standard tends to land at the lowest common denominator of what participants are willing to share.
The second is a regulator or a standards body. The CFA Institute has published AI ethics guidance for investment professionals, AIMA has issued principles for hedge funds, and the UK's Investment Association has done similar work for asset managers. None of these is a benchmark. A regulator-led evaluation standard would carry the most weight and arrive the most slowly, and would likely be framed as a validation requirement rather than a performance leaderboard. The European AI Act's high-risk system provisions are the closest existing scaffolding, and they are about process documentation rather than empirical accuracy testing.
The third is a dominant vendor or platform. Anthropic, as the convenor of MCP, has the natural surface area. A frontier model lab publishing a multi-source financial agent benchmark, with the MCP connectors of the major expert networks and data vendors as the source layer, would establish a de facto standard within months. The trade-off is that a vendor-defined standard inherits the vendor's commercial interests. The same critique that applies to Daloopa's grounding benchmark, that a vendor grading a category it competes in carries a structural conflict, applies more sharply to a platform grading a category it owns.
Our read is that the consortium path is the right outcome and the vendor path is the likely outcome. Standards rarely emerge from the actor with the cleanest incentives. They emerge from the actor with the most leverage and the most urgency. Right now that is a frontier lab, not a buy-side coalition.
What this means for the expert-network layer
The expert networks and data vendors that have shipped MCP connectors are now part of a stack whose accuracy is being judged by their customers using private benchmarks the vendors cannot see. That is an uncomfortable position. A vendor whose transcripts feed a Claude agent that produces a wrong answer will be blamed for the wrong answer, even if the agent's composition logic, not the transcript, was the failure point. The only defence is to be able to point to a public, neutral evaluation showing that the connector behaves correctly under the conditions the customer cares about.
This is the practical reason vendors will end up funding a shared benchmark even if they would prefer not to. The alternative is being graded on a benchmark they did not see, by a customer who has no way to distinguish the connector's contribution from the model's. The faster path is to converge on a public standard, contribute representative source material, and compete on measured quality rather than narrative.
It is also the practical reason the buy-side should want a public standard sooner rather than later. The internal-golden-set economy is expensive, and the cost is rising as the connector surface area grows. A firm running ten MCP connectors today is maintaining a golden set that has to cover the cross product of those connectors. Adding the eleventh connector means revisiting the entire set. The marginal cost of in-house evaluation is increasing faster than the marginal benefit, which is the condition under which firms historically agree to coordinate.
What we will be watching
The leading indicators of a shift here are concrete. A consortium announcement from a group of named asset managers, with a third-party convenor and a published methodology, would be the strongest signal. A frontier lab publishing a multi-source financial agent benchmark with named expert-network and data-vendor connectors as the source layer would be the second strongest. A regulator (the SEC's investment management division, the FCA, or ESMA) signalling that AI-generated research outputs require empirical validation against an external benchmark would be the slowest but most consequential. A continued accumulation of vendor-authored benchmarks, each claiming category leadership against a self-selected peer set, would be the signal that the coordination problem is not getting solved and the category is heading for a fragmented evaluation landscape that mostly serves marketing.
The questions a research analyst should be putting to vendors and to internal AI-governance leads in the next two quarters are narrow and answerable. Which public benchmarks does the vendor publish results against, and on what cadence. What is the firm's internal multi-source evaluation methodology, and who designed it. How does the firm grade the composition step, not just the retrieval step. What is the firm's policy when two connected sources disagree, and how is that policy tested. Who, internally, owns the decision to promote a new connector or model version to production, and what evidence do they require. The firms that can answer these questions today are running ahead of the category. The firms that cannot are exposed.
Powering institutional-grade transcription for expert networks.
INFLXD provides AI-powered, human-edited transcription with sub-1% error rates for the world's leading expert networks and financial research firms.
Visit inflxd.com →Keep reading.

Magnetar prepares AI-agent equity fund for 2026 launch
The $18 billion firm is building a long-biased equity strategy where hundreds of AI agents handle research work normally done by analyst teams.

Accenture Ventures takes stake in AlphaSense, sets agentic workflow partnership
The consulting firm's venture arm backs the market intelligence platform as the two move to embed AlphaSense data inside enterprise AI agents.

AlphaSense raises $350M at $7.5B valuation, crosses $600M ARR
The market intelligence platform extends its content moat and AI roadmap with fresh capital from J.P. Morgan Private Capital and Viking Global Investors.

