Article·AI Engineering & Research·Sep 3, 2025
15 min read

Building a Voice Archive Search Tool with Deepgram’s STT, Cohere Embeddings, and Pinecone

Learn how to build a voice archive search app by chaining Deepgram (Nova-3) STT → Cohere embeddings → Pinecone vector search in this tutorial. The app (FastHTML + HTMX) lets you upload MP3/WAV or paste a URL, auto-transcribe, segment with timestamps/speakers, (optionally) redact personally identifying information, embeds, and upsert to Pinecone.
15 min read
By Stephen Oladele
Last Updated

⏩ TL;DR

  • Build a voice archive search app by chaining Deepgram (Nova-3) STT → Cohere embeddings → Pinecone vector search.

  • The app (FastHTML + HTMX) lets you upload MP3/WAV or paste a URL, auto-transcribe, segment with timestamps/speakers, (optionally) redact PII, embeds, and upsert to Pinecone.

  • UI highlights: consistent cards, inline loaders, a “Limit to current file” toggle with Clear file scope, a similarity threshold slider, a result count select, and ▶ Play to jump to the exact segment.

  • Benchmarks to watch: WER for transcription, latency + Recall@K/Precision@K for search, target RTF ≤ 1, and sub-300 ms p95 query latency for interactive use.

  • The project repo evaluation helpers (nDCG, Recall@K, MRR) let you paste gold IDs and score your search results.

  • Swap/scale components independently (model choice, embedding dims, index params) as your accuracy and throughput needs grow.

Introduction

Voice archives are useless if you can’t search them. Support teams, compliance officers, and product analysts routinely sit on thousands of hours of call recordings and meeting audio—and still can’t instantly surface the 15 seconds that matter. 

Keyword search fails on synonyms, phrasing, accents, or disfluencies. What you need is semantic search over transcripts, powered by accurate, low‑latency speech‑to‑text (STT).

In this guide, you’ll build a voice archive search system that combines Deepgram’s speech‑to‑text API with vector embeddings and a vector database (e.g., Pinecone, Weaviate, FAISS) to deliver meaning‑aware retrieval over massive audio libraries. 

You’ll implement the full pipeline—transcription (with timestamps and diarization), smart chunking, embedding, indexing, and querying with metadata filters and reranking—and learn how to measure search quality beyond WER using IR metrics like nDCG/MRR.

You’ll learn:

  1. Why semantic search STT outperforms keyword search for audio archives.

  2. How to design the end-to-end pipeline: Deepgram STT → chunking → embeddings → vector DB (Pinecone) → semantic query.

  3. Practical vector DB choices (Pinecone, Weaviate, FAISS, pgvector) and how to structure your index and metadata (speaker, timestamps, channel, call_id).

  4. Operational/scaling tips: cost controls, re-embedding cadence, privacy/PII handling, evaluation beyond WER.

Here’s what the final app will look like (grab the code from this repo):

Up next: Start by learning why accuracy, diarization, and timestamp fidelity at the STT layer directly determine your downstream retrieval quality.

Most organizations have terabytes of unstructured audio (support calls, sales demos, interviews, and internal meetings). Converting that audio into searchable, structured text is the first step; making it semantically retrievable is what actually unlocks value. 

You transcribe audio using an STT API like Deepgram’s Nova-3, then embed transcript chunks into a vector space so you can retrieve by meaning, not just exact words (“charge reversal” matches “refund request,” “I want my money back,” etc.).

But the STT layer determines your ceiling. For a reliable semantic search pipeline, your transcription must provide:

  • Accurate word timings → to return precise timestamped snippets

  • Speaker diarization → to filter by agent vs. customer, or attach accountability

  • Channel separation → for cleaner attribution in contact centers

  • Punctuation, casing, normalization → to improve embedding quality

  • Language detection & custom vocabulary → to handle multilingual/industry jargon robustly

That’s why Deepgram’s STT (real-time or batch) is a strong foundation: you get timestamps, diarization, filler words, and vocabulary control out of the box, producing transcripts that are embedding-ready for high-recall, context-aware retrieval.

How the Pipeline Works

  1. Deepgram STT produces word-level timestamps, speaker diarization, and custom vocabulary boosts with robust performance in noisy, real-world audio.

  2. Smart chunking (time- or silence-based) groups sentences into 15-30s blocks.

  3. Embedding model (e.g., OpenAI text-embedding-3-small or an open-source alternative) converts each chunk into a 1,024-dim vector (Cohere).

  4. Vector DB (Pinecone) indexes embeddings + rich metadata (speaker_id, start_ms, call_id).

  5. Semantic query → top-K matches + rerank → show snippet with timestamp and speaker.

Result: Fast retrieval that respects synonyms, context, and intent.

What Industries Does Fast Voice Archive Semantic Search Matter To?

1️⃣ Customer Support

Search: “frustrated customer asking for a refund after a failed charge” → Retrieve all customer-spoken segments across 50k calls, regardless of phrasing. Metadata filters: speaker=customer, product=ProPlan, sentiment<=-0.6.

2️⃣ Meetings and Knowledge Ops

Search: “decision to delay the launch date” → Return the timestamped snippet + next 30 seconds of context, plus who said it. Helps engineering and PMs skip 90-minute recordings.

3️⃣ Compliance and Risk

Search: “mention of PCI data”, “promises beyond policy” → Hybrid search (dense + keyword) ensures you catch exact legal terms and paraphrases. Combine with PII redaction before indexing to stay compliant.

4️⃣ Recruiting/HR Interviews

Search: “examples of leading cross-functional teams” → Return candidate responses semantically aligned to leadership criteria. (Be explicit about fairness, reproducibility, and the legal context if used for decisions.)

Up next? Build the voice archive semantic search tool! You’ll implement the full pipeline:

Nova-3 STT → diarization/timestamps → chunking → embeddings (Cohere) → vector DB (Pinecone) → hybrid search + reranking → timestamped snippets.

Step-by-Step Guide to Building a Voice Archive Search Tool with Semantic Search STT (Deepgram + Cohere + Pinecone)

In this section, you’ll build an end-to-end voice archive search pipeline that:

  1. Transcribes audio to text (with speakers & timestamps)

  2. Splits transcripts into semantically meaningful chunks

  3. Embeds chunks into vectors

  4. Indexes vectors

  5. Queries the index with a natural-language, semantic search

We’ll combine the following layers:

👉 The finished repo lives here →

We’ll go beyond a toy demo by:

  • Segmenting transcripts using timestamps and speakers (not '. ' splits)

  • Indexing rich metadata (speaker, start/end time, file/session_id)

  • Showing how to scale and evaluate your system

🛠 Prerequisites

We’ll use Python, Deepgram’s SDK, Cohere embeddings, and Pinecone—all wired up in a minimal FastHTML web app. Before getting started, ensure you have:

  • Python 3.10+ installed.

  • API keys from:

  • Deepgram (STT; sign up to get 200 USD free credits, which should be enough for this tutorial. Obtain API key from the developer console

  • Cohere (embeddings; the trial keys should be enough for this tutorial)

  • Pinecone (vector DB; serverless index host)

By the way, learn how to create a server index host in this documentation. And here’s the dashboard you’ll find your serverless index host:

Clone the repo and follow along:

Step 1: Transcribe Audio with Deepgram

The app uses voice_archive.py::transcribe_file_structured() to call Deepgram Nova‑3 with:

  • smart_format, punctuate, utterances, diarize (speaker labels)

  • Word‑level timings so you can segment precisely and jump playback.

  • Format the response into a speaker-labeled transcript.

Key idea (simplified view of what’s in the repo):

What this does

💡 Tip: Deepgram’s prerecorded endpoint accepts files up to 2 GB with a 10-minute processing cap for Nova/Base models (see docs). Split longer recordings with

… to avoid timeouts and parallelise transcription.

Step 2: (Optional) PII Redaction Before Indexing

A tiny regex pass replaces obvious emails, phones, SSNs, card numbers, IPv4 with tokens like [EMAIL], [PHONE], etc., before embeddings are computed and stored.

 Toggle with REDACT_PII=true|false in .env.

(💡Tip: For production systems, consider trying the PII redaction with Deepgram’s redaction feature before storing vectors.)

Step 3: Chunking the Transcript

For high-precision results, split the transcript into logical segments. In our demo we use simple sentence splits, but you can refine with silence detection or fixed-length windows.

💡 Tip: For long meetings, consider paragraph-level or topic-based chunking (e.g., every 30 seconds or at speaker changes).

Step 4: Generate Embeddings (Cohere)

Convert each text segment into a 1,024-dim float vector using Cohere’s embed-v4.0 model (see voice_archive.py::generate_embeddings()):

Why segment size matters:

  • Sentence-level: precise quotes but larger index.

  • Paragraph-level: broader context, fewer vectors.

  • Hybrid: index sentences, aggregate for queries.

Step 5: Index in Pinecone

Upsert each vector into your Pinecone namespace along with its rich metadata (voice_archive.py::upsert_segments()):

  • text, speaker, start, end, file, session

💡 Best practice: Include additional metadata (e.g., {"call_id":..., "speaker":...}) so you can filter later. For long‑term cleanliness, consider stable IDs (e.g., file_hash:i) so re‑uploads overwrite prior vectors.

Step 6: Query the Index

Transform the user’s natural-language query into a vector and fetch the top K matches:

💡Quick UX tip: Display similarity scores, transcript snippets, and link back to the original timestamped audio.

Step 7: Hook into a Web UI (FastHTML + HTMX + Tailwind CSS)

Your app.py wires this all together and you can start the server locally: python app.py

Browse to http://localhost:5001 You should see an interface similar to this:

1. Upload a .wav/.mp3 or Process URL (direct file link). When processing finishes, you’ll see a “Processing Complete” card and a player. (The card is collapsible to save space; the player is kept persistent.)

2. Search:

  • Enter a natural query (e.g., “frustrated customer refund request”).

  • Choose Results (top‑K) and a Similarity threshold.

  • Click Search Archives → a thin progress bar appears.

  • Results render directly under the search box, each with:

  • Similarity score

  • [start–end] timestamps + speaker chip

  • Transcript snippet

  • ▶ Play (jumps the player to start)

  • Evaluate (optional):

  • Expand “📏 Evaluation (optional)” in the search form.

  • Tick “Show result IDs” and run a search (cards show id: ...).

  • Copy relevant IDs into the textarea (one per line or comma‑separated).

  • Search again → a 📊 Evaluation card shows nDCG@k, Recall@k, MRR.

Here is a video demo of the FastHTML application that wraps the backend logic and displays semantic search results with adjustable parameters:

Here’s what the backend log looks like from the terminal:

📎 Full Minimal Pipeline Script

If you’d rather skip the web UI, run this from your command line:

Expect an output like:

  1. Plug the search API into Slack/MS Teams for instant call recall

  2. Add RAG summaries: feed top-k snippets into GPT-4o to answer free-text questions

  3. Swap Cohere for open-source Jina Embeddings v2 if you need on-prem

  4. Try Weaviate or pgvector to avoid external SaaS if data residency is critical.

  5. Consider streaming real-time transcription for live call-coaching dashboards.

🤖Creator’s Note: Got a working app? Now it’s time to automate how it takes action with those search results.
Plug in a Voice Agent API to get all the benefits of a voice-native agentic system out-of-the-box, so your app can automatically book meetings, schedule tasks, make orders, and perform downstream tasks.

In the next section, learn how you can scale this system!

Scale and Parallelize the Voice Archive Search System

For large archives, batch-process using Python’s concurrent.futures (also in the repo):

The repo’s batch_transcribe() uses a ThreadPoolExecutor to parallelise large backfills:

Test it out:

Scaling and Production Notes for the Voice Archive Search App

Here are ten scaling and production notes you should be aware of:

  1. Segmentation: Aim for ~15–30s per segment (or 100–300 tokens).

  2. Throughput: ~25 × real-time on a modest VM (Nova-3, 5 concurrent). Use async + thread pools (already in repo) or background workers for large archives.

  3. Concurrency: Deepgram allows 100 concurrent Nova requests per project, so tune max_workers accordingly.

  4. Hybrid search: Add BM25/keyword filters for compliance/legal terms.

  5. Retrain cadence: Re-embed when you switch embedding models or after major product vocab changes.

  6. Privacy: The regex redactor is intentionally small; use a dedicated PII service for regulated workloads.

  7. Quality and Ops: Track WER samples, retrieval metrics (nDCG/Recall/MRR), latency (Pinecone < 100 ms P95), and cost per hour indexed/1k queries.

  8. Cost control: Store original WAVs in cold storage; keep transcripts + vectors hot.

  9. Vector DB: Pinecone serverless autosplits shards. So watch QPS limits (< ? per project) and set pod_type="starter" → “s1” as you grow.

  10. Embeddings: Batch up to 96 sentences per Cohere request; reuse HTTP/2 connections.

Troubleshooting (CLI + UI)

Here are some errors you might encounter, the likely causes, and recommended fixes:

Great! With this tutorial, you now have a reproducible, scalable voice archive search tool. 

Whether you’re surfacing customer complaints, unpacking meeting decisions, or auditing compliance calls, this stack turns hours of audio into an instantly queryable knowledge base.

A voice‑archive search system has two layers to measure:

  1. Speech‑to‑Text (STT) quality and speed

  2. Semantic retrieval quality and speed

You’ll get the best results by tracking both because errors in STT can cascade into retrieval.

What Metrics Do You Measure?

While every dataset is different, teams typically anchor on a small set of well‑understood metrics and target bands:

For Speech‑to‑Text (STT) (Deepgram)

  • Word Error Rate (WER): Primary quality metric. Acceptable ranges depend on audio quality and domain. % (substitutions + insertions + deletions)/reference words. Lower is better.

  • Real‑Time Factor (RTF): Speed metric; processing_time/audio_duration. RTF ≤ 1 means processing keeps up with audio duration.

  • Diarization quality (optional): Track DER (diarization error rate) or at least speaker‑change accuracy if you rely on speaker turns.

  • Formatting quality: Casing, punctuation, numbers; simple pass/fail checks or spot audits often suffice.

  • Operational signals: Error rate, timeouts, and p50/p95 STT latency.

📝 Note: For WER, here are some ranges to help you:

  • Clean/near‑field speech: ~3–8% WER is common for strong models.

  • Conversational /mixed‑quality calls: ~8–15% WER is a realistic target.

  • Far‑field/noisy: 12–25% WER is typical unless you add domain adaptation.

For Semantic Retrieval (Cohere + Pinecone)

  • Recall@K and nDCG@K: Ability to surface relevant chunks; teams often aim for Recall@10 ≥ 0.9 on their internal test sets.

  • MRR: Rewards getting a relevant result very high in the list (already exposed in the repo UI).

  • Latency (p95/p99): User‑perceived speed; for interactive apps, keep p95 ≤ 100–200 ms per query (excluding first‑time cold starts).

  • Throughput (QPS) and index size: Capacity planning signals as your archive grows.

  • Filter effectiveness (when scoping to file/session): fraction of results retained when filters are on.

💡Tips: 

  • Treat p95 latency as your UX guardrail (what most users feel), not just the average.

  • Don’t compare your numbers to generic “leaderboards” without matching audio/domain. Instead, maintain a frozen, representative test set of your own data and track trends against that set release‑to‑release.

How to Measure in this Repo:

1. In the Search panel, open 📏 Evaluation (optional). Paste ground‑truth vector IDs (one per line). The UI will compute: nDCG@k, Recall@k, and MRR using the functions in evaluate.py.


2. To scope retrieval to the most recent file/session (which also improves evaluation accuracy), use the Limit to current file toggle.

3. To collect basic timing, wrap the /search handler with timestamps and log p50/p95 over time.

Sensible Targets (Use as Starting Points, Then Tune)

These are pragmatic goals for interactive tools; adjust to your domain and data

Tooling You Can Use

- Deepgram usage/latency: view in logs/console; export to your observability stack.

- Vector store metrics (Pinecone): inspect query latency percentiles and index stats; combine with your app logs to compute Recall@K/nDCG@K using known‑good IDs.

Iteration Plan

Step 1: Own your test sets. Create two small but stable sets:

  • Quality (labeled snippets for recall/ndcg/mrr)

  • Speed (queries for latency/QPS). Refresh only when your data distribution changes.

Step 2: Automate evaluations. Use your existing helpers (recall_at_k, ndcg_at_k, mrr) in a nightly job and store results per commit/model/index config.

Step 3: Tune in measured loops.

  • Reduce WER: domain vocab/phrases, channel‑specific models, audio pre‑processing (VAD/denoise), and prompt‑style hints where supported.

  • Boost recall without blowing up latency: tune top_k, try hybrid (keyword + vector), adjust ANN parameters (e.g., HNSW efSearch), or pre‑filter by metadata (speaker/file/session) to shrink the candidate set.

  • Lower latency: warm caches, pin hot namespaces, consider batch queries for multi‑panel UIs, and right‑size index resources.

Step 4: Collect user signals. Lightweight thumbs‑up/down on results or “was this helpful?” improves future evaluation sets and can drive re‑ranking.

Step 5: Guardrails. Add simple alerts when p95 latency or Recall@10 regresses by >X% from the last release.

📝 Practical note for the repo: the UI already supports basic evaluation inputs and shows Recall@K, MRR, and nDCG. Keeping those visible in PRs (screenshots + numbers) makes performance changes reviewable, not just “looks good.”

Conclusion: How to Build a Voice Archive Semantic Search App With Deepgram, Cohere, and Pinecone

Pairing Deepgram Nova-3 transcription with vector embeddings and Pinecone turns raw audio into a searchable knowledge base. 

Here’s the workflow you built in this guide:

  1. Accurate transcripts: word-level timestamps, speaker turns, and noise-robust models give you clean text for downstream processing.

  2. Meaning-based retrieval: embeddings let users ask “show me churn risk” instead of guessing the exact words the caller used.

  3. Scalable architecture: transcripts are chunked, embedded, and written to a serverless vector index, so growth is just “add pods/replicas.”

  4. Auditable workflows: every match links back to a timestamped audio snippet, satisfying compliance and QA teams.

The UI lets users upload or link files, scope searches to the current file, and evaluate retrieval quality (nDCG/Recall/MRR) against known‑good IDs. 

The net result: less manual scrubbing of recordings, faster insight turnaround, and a foundation you can extend to summarization, RAG, or real-time alerting.

Ready to build your own voice archive search app? Try Deepgram in the Playground or sign up for a free account and get $200 in credits to start using our Speech-to-Text API.

FAQs

Why Use Semantic Search Instead of Keyword Search for Audio?

Speech is messy—synonyms, fillers, and paraphrases abound. Keyword search only finds exact tokens, so “issue with my bill” may miss “problem on my invoice.” Semantic search encodes meaning in vectors, recovering those intent-level matches and typically lifting recall and user satisfaction versus pure keyword filters.

Which Vector Databases Pair Well with Deepgram transcripts?

Any ANN-capable vector DB that accepts 1,024-d float vectors works. Community favourites:

All three expose filters (e.g. file == …) that you can drive from the “Limit to current file” toggle in the repo.

What Are Tips for Higher Accuracy on Noisy Audio?

  1. Model choice: Use noise-robust models like Nova-3; it’s trained on far-field/crosstalk data.

  2. Custom vocabulary: Seed domain names (“LAN port”, “tier-one”) to cut OOV errors.

  3. Pre-clean the file (e.g., denoise or normalize levels with FFmpeg): Normalize volume, remove silence, or split >30 min files into smaller chunks before upload.

  4. Benchmark and iterate: Keep a noisy-audio test set; track WER after each tweak.

Does Semantic Search Trade Speed For Accuracy?

Not in practice. With HNSW or IVF-HNSW indexes:

  • Single-vector queries on <10 M items often finish in 10-50 ms.

  • Raising search parameters (e.g. efSearch) bumps recall with a modest latency hit; you can tune for p95 < 200 ms while staying above 90% recall@10.

That’s fast enough for the real-time UX your app surfaces.

Clear File Scope?

If you need to jump back to a global archive search, hit the “Clear file scope” button next to the Limit to current file checkbox—your search form will revert to querying the full namespace.

Ready to build your own voice archive search app? Try Deepgram in the Playground or sign up for a free account and get $200 in credits to start using our Speech-to-Text API.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.