INFLXD MediaSubscribe →
AI and Data

Why AI transcription still hasn't replaced human review in finance

Sub-5% WER on conversational benchmarks is impressive. It's also irrelevant to the part of the workflow that matters.

INFLXD Research··4 min read
Why AI Transcription Still Hasn't Replaced Human Review in Finance

AI transcription vendors have spent the last three years closing the accuracy gap on general speech. OpenAI's Whisper, released in September 2022, set the open-source baseline. Deepgram's Nova-3 and AssemblyAI's Universal-2, both released in 2024, claim word error rates under 5% on conversational benchmarks. Google's Chirp models sit in similar territory.

None of this has changed the fact that every major financial institution still puts a human on the final pass of any transcript that touches the investment process.

What the benchmarks miss

WER on conversational audio is a fine number for podcasts and customer service calls. It does not survive contact with a sell-side analyst calling into a Q3 earnings call where the CFO is talking over the IR head about EBITDAX adjustments while the line cuts in and out from a Tokyo hotel.

Four failure modes show up consistently:

Domain vocabulary. Whisper, Nova-3, and Universal-2 are trained on web-scale audio. Finance vocabulary is underrepresented. TSMC routinely comes back as "DSMC" or lowercase. EBITDAX gets flattened to EBITDA. MNPI, NDR, GRR, and segment-specific acronyms misfire often enough that any analyst building a model from raw ASR output is doing free QA work.

Speaker diarization. Two-speaker audio is largely solved. Earnings calls and expert calls routinely involve five to ten participants, with the operator, multiple executives, and a queue of analysts. Diarization accuracy degrades sharply past three speakers, and the failure mode (attributing a guidance comment to the wrong executive) is exactly the kind of error that ends up in an IC memo and then in a complaint.

MNPI redaction. ASR transcribes everything it hears. It does not know that the expert just dropped a customer concentration figure that hasn't been disclosed publicly. Compliance teams at expert networks strip MNPI from transcripts before delivery. That is a human judgement layer, not a model.

Audit trails. An IC memo that cites a transcript needs a defensible chain. "Our ASR vendor's model produced this output" does not survive scrutiny when the number turns out to be wrong. Human-reviewed transcripts come with a reviewer ID, a timestamp, and a sign-off. That's the artefact compliance wants.

The economics work

The cost gap is not subtle. Human review runs roughly USD 0.20 to 0.40 per minute of audio. Pure-AI transcription is USD 0.01 to 0.02 per minute, an order of magnitude or two cheaper.

Financial institutions are paying the premium anyway. The reason is straightforward: a single bad transcript that feeds a wrong number into a model is more expensive than a year of human review across the desk. The SEC has brought enforcement actions tied to inaccurate filings and disclosures often enough that compliance teams treat transcript accuracy as a hard requirement, not a quality preference.

The vendor signal

The clearest read on where the market actually is comes from what the leading vendors ship, not what they claim. Aiera and Quartr both run human-in-the-loop pipelines on their financial transcript products. These are companies whose entire pitch is speed and integration with analyst workflows. If pure ASR were good enough, they would be running it. They are not.

That is the durable signal. Vendors with skin in the game on transcript accuracy have all converged on hybrid architectures. The pure-AI players are positioning around adjacent use cases (meeting notes, podcast indexing, customer support) where the cost of a hallucinated quarterly figure is zero.

What to watch: the next round of ASR releases will likely close more of the WER gap on general benchmarks. Watch whether any of them publish finance-specific WER, with diarization accuracy on 5+ speaker audio. That would be a meaningful disclosure. So far, none have.

Disclosure: Drafted with AI assistance and reviewed by INFLXD editors against the newsroom's editorial rubric. Source links above are the primary factual basis for every claim.

Position B disclosure: INFLXD has commercial relationships with one or more of the companies named in this article. See our editorial disclosures.

From INFLXD

Powering institutional-grade transcription for expert networks.

INFLXD provides AI-powered, human-edited transcription with sub-1% error rates for the world's leading expert networks and financial research firms.

Visit inflxd.com →