Building a Voice Archive Search Tool with Deepgram’s STT, Cohere Embeddings, and Pinecone

⏩ TL;DR
Introduction
Why Does Voice Archive Search Need Semantic Search?
How the Pipeline Works
What Industries Does Fast Voice Archive Semantic Search Matter To?
Step-by-Step Guide to Building a Voice Archive Search Tool with Semantic Search STT (Deepgram + Cohere + Pinecone)
🛠 Prerequisites
Step 1: Transcribe Audio with Deepgram
Step 2: (Optional) PII Redaction Before Indexing
Step 3: Chunking the Transcript
Step 4: Generate Embeddings (Cohere)
Step 5: Index in Pinecone
Step 6: Query the Index
Step 7: Hook into a Web UI (FastHTML + HTMX + Tailwind CSS)
📎 Full Minimal Pipeline Script
Recommended Next Steps
Scale and Parallelize the Voice Archive Search System
Scaling and Production Notes for the Voice Archive Search App
Troubleshooting (CLI + UI)
Benchmarks and Metrics for Voice Archive Search
What Metrics Do You Measure?
How to Measure in
Sensible Targets (Use as Starting Points, Then Tune)
Tooling You Can Use
Iteration Plan
Conclusion: How to Build a Voice Archive Semantic Search App With Deepgram, Cohere, and Pinecone
FAQs
Why Use Semantic Search Instead of Keyword Search for Audio?
Which Vector Databases Pair Well with Deepgram transcripts?
What Are Tips for Higher Accuracy on Noisy Audio?
Does Semantic Search Trade Speed For Accuracy?
Clear File Scope?

Share this article

By Stephen Oladele

Last Updated

Sep 3, 2025

⏩ TL;DR

Build a voice archive search app by chaining Deepgram (Nova-3) STT → Cohere embeddings → Pinecone vector search.
The app (FastHTML + HTMX) lets you upload MP3/WAV or paste a URL, auto-transcribe, segment with timestamps/speakers, (optionally) redact PII, embeds, and upsert to Pinecone.
UI highlights: consistent cards, inline loaders, a “Limit to current file” toggle with Clear file scope, a similarity threshold slider, a result count select, and ▶ Play to jump to the exact segment.
Benchmarks to watch: WER for transcription, latency + Recall@K/Precision@K for search, target RTF ≤ 1, and sub-300 ms p95 query latency for interactive use.
The project repo evaluation helpers (nDCG, Recall@K, MRR) let you paste gold IDs and score your search results.
Swap/scale components independently (model choice, embedding dims, index params) as your accuracy and throughput needs grow.

Introduction

Voice archives are useless if you can’t search them. Support teams, compliance officers, and product analysts routinely sit on thousands of hours of call recordings and meeting audio—and still can’t instantly surface the 15 seconds that matter.

Keyword search fails on synonyms, phrasing, accents, or disfluencies. What you need is semantic search over transcripts, powered by accurate, low‑latency speech‑to‑text (STT).

In this guide, you’ll build a voice archive search system that combines Deepgram’s speech‑to‑text API with vector embeddings and a vector database (e.g., Pinecone, Weaviate, FAISS) to deliver meaning‑aware retrieval over massive audio libraries.

You’ll implement the full pipeline—transcription (with timestamps and diarization), smart chunking, embedding, indexing, and querying with metadata filters and reranking—and learn how to measure search quality beyond WER using IR metrics like nDCG/MRR.

You’ll learn:

Why semantic search STT outperforms keyword search for audio archives.
How to design the end-to-end pipeline: Deepgram STT → chunking → embeddings → vector DB (Pinecone) → semantic query.
Practical vector DB choices (Pinecone, Weaviate, FAISS, pgvector) and how to structure your index and metadata (speaker, timestamps, channel, call_id).
Operational/scaling tips: cost controls, re-embedding cadence, privacy/PII handling, evaluation beyond WER.

Here’s what the final app will look like (grab the code from this repo):

Up next: Start by learning why accuracy, diarization, and timestamp fidelity at the STT layer directly determine your downstream retrieval quality.

Why Does Voice Archive Search Need Semantic Search?

Most organizations have terabytes of unstructured audio (support calls, sales demos, interviews, and internal meetings). Converting that audio into searchable, structured text is the first step; making it semantically retrievable is what actually unlocks value.

You transcribe audio using an STT API like Deepgram’s Nova-3, then embed transcript chunks into a vector space so you can retrieve by meaning, not just exact words (“charge reversal” matches “refund request,” “I want my money back,” etc.).

But the STT layer determines your ceiling. For a reliable semantic search pipeline, your transcription must provide:

Accurate word timings → to return precise timestamped snippets
Speaker diarization → to filter by agent vs. customer, or attach accountability
Channel separation → for cleaner attribution in contact centers
Punctuation, casing, normalization → to improve embedding quality
Language detection & custom vocabulary → to handle multilingual/industry jargon robustly

That’s why Deepgram’s STT (real-time or batch) is a strong foundation: you get timestamps, diarization, filler words, and vocabulary control out of the box, producing transcripts that are embedding-ready for high-recall, context-aware retrieval.

Difference between keyword/lexical search (search only returns exact matches with no context) and semantic search (response is ranked by cosine similarity, with timestamps and speakers).

How the Pipeline Works

Deepgram STT produces word-level timestamps, speaker diarization, and custom vocabulary boosts with robust performance in noisy, real-world audio.
Smart chunking (time- or silence-based) groups sentences into 15-30s blocks.
Embedding model (e.g., OpenAI text-embedding-3-small or an open-source alternative) converts each chunk into a 1,024-dim vector (Cohere).
Vector DB (Pinecone) indexes embeddings + rich metadata (speaker_id, start_ms, call_id).
Semantic query → top-K matches + rerank → show snippet with timestamp and speaker.

Result: Fast retrieval that respects synonyms, context, and intent.

What Industries Does Fast Voice Archive Semantic Search Matter To?

1️⃣ Customer Support

Search: “frustrated customer asking for a refund after a failed charge” → Retrieve all customer-spoken segments across 50k calls, regardless of phrasing. Metadata filters: speaker=customer, product=ProPlan, sentiment<=-0.6.

2️⃣ Meetings and Knowledge Ops

Search: “decision to delay the launch date” → Return the timestamped snippet + next 30 seconds of context, plus who said it. Helps engineering and PMs skip 90-minute recordings.

3️⃣ Compliance and Risk

Search: “mention of PCI data”, “promises beyond policy” → Hybrid search (dense + keyword) ensures you catch exact legal terms and paraphrases. Combine with PII redaction before indexing to stay compliant.

4️⃣ Recruiting/HR Interviews

Search: “examples of leading cross-functional teams” → Return candidate responses semantically aligned to leadership criteria. (Be explicit about fairness, reproducibility, and the legal context if used for decisions.)

Up next? Build the voice archive semantic search tool! You’ll implement the full pipeline:

Nova-3 STT → diarization/timestamps → chunking → embeddings (Cohere) → vector DB (Pinecone) → hybrid search + reranking → timestamped snippets.

Here’s what the final app will look like (grab the code from this repo)

Step-by-Step Guide to Building a Voice Archive Search Tool with Semantic Search STT (Deepgram + Cohere + Pinecone)

In this section, you’ll build an end-to-end voice archive search pipeline that:

Transcribes audio to text (with speakers & timestamps)
Splits transcripts into semantically meaningful chunks
Embeds chunks into vectors
Indexes vectors
Queries the index with a natural-language, semantic search

We’ll combine the following layers:

Layer	Technology	Why
Speech-to-Text	Deepgram Nova-3	Low-latency, high-accuracy, diarization, word-timestamps
Embeddings	Cohere embed-v4.0	1024-dim vectors tuned for semantic search
Vector DB	Pinecone (serverless)	Millisecond ANN search, no infra to manage
Web UI	FastHTML + HTMX + Tailwind	A reactive Python UI (no JS build step)

👉 The finished repo lives here →

We’ll go beyond a toy demo by:

Segmenting transcripts using timestamps and speakers (not '. ' splits)
Indexing rich metadata (speaker, start/end time, file/session_id)
Showing how to scale and evaluate your system

Voice archive semantic search pipeline (end-to-end pipeline and user search query).

🛠 Prerequisites

We’ll use Python, Deepgram’s SDK, Cohere embeddings, and Pinecone—all wired up in a minimal FastHTML web app. Before getting started, ensure you have:

Python 3.10+ installed.
API keys from:
Deepgram (STT; sign up to get 200 USD free credits, which should be enough for this tutorial. Obtain API key from the developer console)
Cohere (embeddings; the trial keys should be enough for this tutorial)
Pinecone (vector DB; serverless index host)

By the way, learn how to create a server index host in this documentation. And here’s the dashboard you’ll find your serverless index host:

Clone the repo and follow along:

git clone https://github.com/Neurl-LLC/deepgram-53
cd deepgram-53
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp env_template.txt .env   # and fill in your keys

Here’s what the final app will look like (grab the code from this repo)

Step 1: Transcribe Audio with Deepgram

The app uses voice_archive.py::transcribe_file_structured() to call Deepgram Nova‑3 with:

smart_format, punctuate, utterances, diarize (speaker labels)
Word‑level timings so you can segment precisely and jump playback.
Format the response into a speaker-labeled transcript.

Key idea (simplified view of what’s in the repo):

async def transcribe_file_with_enhancements(audio_path, num_speakers=None):
    deepgram = DeepgramClient(DEEPGRAM_API_KEY)
    with open(audio_path, 'rb') as f:
        payload = {"buffer": f.read()}
    options = PrerecordedOptions(
        model="nova-3",
        smart_format=True,  # punctuation + casing
        diarize=True,      # speaker labels
        diarize_version="2023-10-09",
        punctuate=True,
        utterances=True    # sentence boundaries
    )
    response = await deepgram.listen.prerecorded.v("1").transcribe_file(payload, options)
    alt = response.results.channels[0].alternatives[0]

    # Build "Speaker 1: … Speaker 2: …" transcript
    if getattr(alt, 'words', None):
        text, current = "", None
        for w in alt.words:
            if w.speaker != current:
                current = w.speaker
                text += f"\nSpeaker {current}: "
            text += w.word + " "
        return text.strip()
    return alt.transcript

What this does

Deepgram option	Why it matters for voice archive search
model="nova-3"	~300 ms/second, production grade
smart_format=True	auto-format numbers, currencies, dates
diarize=True	Enables speaker filters/labels (Speaker 0/1…, or agent vs. customer)
utterances=True	Gives reliable sentence breaks ⇒ natural chunk boundaries
punctuate	higher-quality embeddings
Word timestamps	Lets you slice audio precisely and the UI jump to the exact second in the audio.

💡 Tip: Deepgram’s prerecorded endpoint accepts files up to 2 GB with a 10-minute processing cap for Nova/Base models (see docs). Split longer recordings with

ffmpeg -i big_meeting.mp3 -f segment -segment_time 900 parts/out%03d.mp3

… to avoid timeouts and parallelise transcription.

Step 2: (Optional) PII Redaction Before Indexing

A tiny regex pass replaces obvious emails, phones, SSNs, card numbers, IPv4 with tokens like [EMAIL], [PHONE], etc., before embeddings are computed and stored.

def redact_text(text: str) -> str:
    """
    Redact a single string if REDACT_PII is enabled; otherwise return unchanged.
    """
    if not _REDACTOR:
        return text
    return _REDACTOR.redact(text)


def redact_segments(segments: List[Segment]) -> List[Segment]:
    """
    Apply redaction to a list of segments in-place style (returns a new list).
    This is called BEFORE embedding/indexing to keep sensitive data out of storage.
    """
    if not _REDACTOR:
        return segments

    redacted = []
    for s in segments:
        redacted.append(
            Segment(
                speaker=s.speaker,
                start=s.start,
                end=s.end,
                text=_REDACTOR.redact(s.text),
                file=s.file,
                session=s.session,
            )
        )
    return redacted

Toggle with REDACT_PII=true|false in .env.

(💡Tip: For production systems, consider trying the PII redaction with Deepgram’s redaction feature before storing vectors.)

Step 3: Chunking the Transcript

For high-precision results, split the transcript into logical segments. In our demo we use simple sentence splits, but you can refine with silence detection or fixed-length windows.

# voice_archive.py → run_pipeline()

transcripts = batch_transcribe(audio_paths)
first = list(transcripts.values())[0]
segments = [s.strip() for s in first.split('. ') if s.strip()]

💡 Tip: For long meetings, consider paragraph-level or topic-based chunking (e.g., every 30 seconds or at speaker changes).

Here’s what the final app will look like (grab the code from this repo)

Step 4: Generate Embeddings (Cohere)

Convert each text segment into a 1,024-dim float vector using Cohere’s embed-v4.0 model (see voice_archive.py::generate_embeddings()):

co = cohere.ClientV2(api_key=COHERE_API_KEY)

def generate_embeddings(texts):
    res = co.embed(
        texts=texts,
        model="embed-v4.0",
        input_type="search_document",
        output_dimension=1024,
        embedding_types=["float"]
    )
    return res.embeddings.float

# Example:
segments = ["Intro to the meeting", "Quarterly results discussion"]
embs = generate_embeddings(segments)
print("Vector[0][:5] =", embs[0][:5])

Why segment size matters:

Sentence-level: precise quotes but larger index.
Paragraph-level: broader context, fewer vectors.
Hybrid: index sentences, aggregate for queries.

Step 5: Index in Pinecone

Upsert each vector into your Pinecone namespace along with its rich metadata (voice_archive.py::upsert_segments()):

text, speaker, start, end, file, session

# voice_archive.py (conceptual excerpt)

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(host=PINECONE_INDEX_HOST)

def upsert_segments(namespace: str, segments: list[Segment]) -> int:
    segments = redact_segments(segments)        # if REDACT_PII is true
    texts = [s.text for s in segments if s.text.strip()]
    vectors = generate_embeddings(texts)        # cohere.ClientV2.embed(...)

    pine_vectors = []
    for i, (seg, vec) in enumerate(zip(segments, vectors)):
        pine_vectors.append({
            "id": f"{seg.session}:{seg.file}:{i}",
            "values": vec,
            "metadata": {
                "text": seg.text, "speaker": seg.speaker or "unknown",
                "start": seg.start, "end": seg.end,
                "file": seg.file, "session": seg.session
            }
        })
    index.upsert(vectors=pine_vectors, namespace=namespace)
    return len(pine_vectors)

records = [
  {"id": f"seg{i}", "chunk_text": seg, "vector": emb}
  for i, (seg, emb) in enumerate(zip(segments, embs))
]
upsert_embeddings(namespace="voice-archives", vectors=records)

💡 Best practice: Include additional metadata (e.g., {"call_id":..., "speaker":...}) so you can filter later. For long‑term cleanliness, consider stable IDs (e.g., file_hash:i) so re‑uploads overwrite prior vectors.

Step 6: Query the Index

Transform the user’s natural-language query into a vector and fetch the top K matches:

# voice_archive.py (conceptual excerpt)

def query_index(query_text, namespace="voice-archives", top_k=5):
    q_emb = generate_embeddings([query_text])[0]
    res = index.query(
        namespace=namespace,
        vector=q_emb,
        top_k=top_k,
        include_metadata=True
    )
    return res.matches

matches = query_index("key decisions on launch date", top_k=10)
for m in matches:
    print(f"{m.score:.3f} → {m.metadata['chunk_text']}")

💡Quick UX tip: Display similarity scores, transcript snippets, and link back to the original timestamped audio.

Step 7: Hook into a Web UI (FastHTML + HTMX + Tailwind CSS)

Your app.py wires this all together and you can start the server locally: python app.py

Browse to http://localhost:5001 You should see an interface similar to this:

1. Upload a .wav/.mp3 or Process URL (direct file link). When processing finishes, you’ll see a “Processing Complete” card and a player. (The card is collapsible to save space; the player is kept persistent.)

2. Search:

Enter a natural query (e.g., “frustrated customer refund request”).
Choose Results (top‑K) and a Similarity threshold.
Click Search Archives → a thin progress bar appears.
Results render directly under the search box, each with:
Similarity score
[start–end] timestamps + speaker chip
Transcript snippet
▶ Play (jumps the player to start)
Evaluate (optional):
Expand “📏 Evaluation (optional)” in the search form.
Tick “Show result IDs” and run a search (cards show id: ...).
Copy relevant IDs into the textarea (one per line or comma‑separated).
Search again → a 📊 Evaluation card shows nDCG@k, Recall@k, MRR.

Here is a video demo of the FastHTML application that wraps the backend logic and displays semantic search results with adjustable parameters:

Here’s what the backend log looks like from the terminal:

📎 Full Minimal Pipeline Script

If you’d rather skip the web UI, run this from your command line:

from voice_archive import run_pipeline

audio_files = ['demo_call.mp3']
run_pipeline(audio_files, query='chargeback dispute')

Expect an output like:

Score 0.823  Speaker 1: I didn’t recognise the charge and want my money back

Here’s what the final app will look like (grab the code from this repo)

Recommended Next Steps

Plug the search API into Slack/MS Teams for instant call recall
Add RAG summaries: feed top-k snippets into GPT-4o to answer free-text questions
Swap Cohere for open-source Jina Embeddings v2 if you need on-prem
Try Weaviate or pgvector to avoid external SaaS if data residency is critical.
Consider streaming real-time transcription for live call-coaching dashboards.

🤖Creator’s Note: Got a working app? Now it’s time to automate how it takes action with those search results.
Plug in a Voice Agent API to get all the benefits of a voice-native agentic system out-of-the-box, so your app can automatically book meetings, schedule tasks, make orders, and perform downstream tasks.

In the next section, learn how you can scale this system!

Scale and Parallelize the Voice Archive Search System

For large archives, batch-process using Python’s concurrent.futures (also in the repo):

from concurrent.futures import ThreadPoolExecutor

def process_audio_file(path):
    return asyncio.run(transcribe_file_with_enhancements(path))

The repo’s batch_transcribe() uses a ThreadPoolExecutor to parallelise large backfills:

def batch_transcribe(audio_paths, max_workers=5):
    with ThreadPoolExecutor(max_workers=max_workers) as exec:
        futures = {exec.submit(process_audio_file, p): p for p in audio_paths}
        results = {}
        for f in concurrent.futures.as_completed(futures):
            path = futures[f]
            results[path] = f.result()
        return results

Test it out:

from voice_archive import batch_transcribe
transcripts = batch_transcribe(glob("call_recordings/*.mp3"), max_workers=5)

Scaling and Production Notes for the Voice Archive Search App

Here are ten scaling and production notes you should be aware of:

Segmentation: Aim for ~15–30s per segment (or 100–300 tokens).
Throughput: ~25 × real-time on a modest VM (Nova-3, 5 concurrent). Use async + thread pools (already in repo) or background workers for large archives.
Concurrency: Deepgram allows 100 concurrent Nova requests per project, so tune max_workers accordingly.
Hybrid search: Add BM25/keyword filters for compliance/legal terms.
Retrain cadence: Re-embed when you switch embedding models or after major product vocab changes.
Privacy: The regex redactor is intentionally small; use a dedicated PII service for regulated workloads.
Quality and Ops: Track WER samples, retrieval metrics (nDCG/Recall/MRR), latency (Pinecone < 100 ms P95), and cost per hour indexed/1k queries.
Cost control: Store original WAVs in cold storage; keep transcripts + vectors hot.
Vector DB: Pinecone serverless autosplits shards. So watch QPS limits (< ? per project) and set pod_type="starter" → “s1” as you grow.
Embeddings: Batch up to 96 sentences per Cohere request; reuse HTTP/2 connections.

Troubleshooting (CLI + UI)

Here are some errors you might encounter, the likely causes, and recommended fixes:

Symptom	Likely cause	Fix
HTTP 401 from Deepgram	Wrong or missing key	Check .env, re-load shell
(Status: 400) ERROR:app:❌ Deepgram returned no segments (from Deepgram)	Almost always a content‑type/header mismatch on the request body	Normalize common WAV/MP3 types so the Deepgram API always gets a value it accepts in _guess_mimetype()
Search returns 0 hits	High threshold/index empty	Lower slider, check namespace, verify embeddings dim = 1024, re-process audio
Duplicate‑looking hits	Expected if you re‑upload the same file (new session_id)	Mitigations: query‑time de‑dup + MMR (built‑in), or adopt stable vector IDs
ValueError: dimension mismatch	Embedding dim ≠ index dim	Re-create Pinecone index with 1024 dims
Slow transcribe	Network connectivity or plan throttle	Use Deepgram async websocket, or upgrade plan
“HTTP 413 payload too large”	Large audio size	Chunk audio smaller or switch to websocket streaming
“API key not found”	Missing or invalid API keys in .env	Check validity of API keys on Deepgram, Cohere, and Pincone, and availability in .env, restart the app
Audio won’t play	Don’t refresh after upload (session is in‑memory)	Ensure /audio/{session} returns 200 with audio/wav or audio/mpeg

Great! With this tutorial, you now have a reproducible, scalable voice archive search tool.

Whether you’re surfacing customer complaints, unpacking meeting decisions, or auditing compliance calls, this stack turns hours of audio into an instantly queryable knowledge base.

Benchmarks and Metrics for Voice Archive Search

A voice‑archive search system has two layers to measure:

Speech‑to‑Text (STT) quality and speed
Semantic retrieval quality and speed

You’ll get the best results by tracking both because errors in STT can cascade into retrieval.

What Metrics Do You Measure?

While every dataset is different, teams typically anchor on a small set of well‑understood metrics and target bands:

For Speech‑to‑Text (STT) (Deepgram)

Word Error Rate (WER): Primary quality metric. Acceptable ranges depend on audio quality and domain. % (substitutions + insertions + deletions)/reference words. Lower is better.
Real‑Time Factor (RTF): Speed metric; processing_time/audio_duration. RTF ≤ 1 means processing keeps up with audio duration.
Diarization quality (optional): Track DER (diarization error rate) or at least speaker‑change accuracy if you rely on speaker turns.
Formatting quality: Casing, punctuation, numbers; simple pass/fail checks or spot audits often suffice.
Operational signals: Error rate, timeouts, and p50/p95 STT latency.

📝 Note: For WER, here are some ranges to help you:

Clean/near‑field speech: ~3–8% WER is common for strong models.
Conversational /mixed‑quality calls: ~8–15% WER is a realistic target.
Far‑field/noisy: 12–25% WER is typical unless you add domain adaptation.

For Semantic Retrieval (Cohere + Pinecone)

Recall@K and nDCG@K: Ability to surface relevant chunks; teams often aim for Recall@10 ≥ 0.9 on their internal test sets.
MRR: Rewards getting a relevant result very high in the list (already exposed in the repo UI).
Latency (p95/p99): User‑perceived speed; for interactive apps, keep p95 ≤ 100–200 ms per query (excluding first‑time cold starts).
Throughput (QPS) and index size: Capacity planning signals as your archive grows.
Filter effectiveness (when scoping to file/session): fraction of results retained when filters are on.

💡Tips:

Treat p95 latency as your UX guardrail (what most users feel), not just the average.
Don’t compare your numbers to generic “leaderboards” without matching audio/domain. Instead, maintain a frozen, representative test set of your own data and track trends against that set release‑to‑release.

**How to Measure in this Repo:**

1. In the Search panel, open 📏 Evaluation (optional). Paste ground‑truth vector IDs (one per line). The UI will compute: nDCG@k, Recall@k, and MRR using the functions in evaluate.py.

2. To scope retrieval to the most recent file/session (which also improves evaluation accuracy), use the Limit to current file toggle.

3. To collect basic timing, wrap the /search handler with timestamps and log p50/p95 over time.

Sensible Targets (Use as Starting Points, Then Tune)

These are pragmatic goals for interactive tools; adjust to your domain and data

Area	Metric	Starter Target	Why it matters
STT quality	WER	≤ 7–12% on your domain data	Keeps queryable text faithful enough for semantic search. Measure against your own references; domain/vocabulary drive the number.
STT speed	RTF	≤ 1.0 (p95)	Near real‑time archiving or faster keeps ingest snappy.
Retrieval (search relevance)	Recall@10	≥ 0.90, MRR ≥ 0.75	With clean transcripts and good chunking. Users also reliably see a relevant hit near the top.
Retrieval	nDCG@10	≥ 0.85	Rank quality across the top results.
Retrieval (query latency)	p95	≤ 200–300 ms	End‑to‑end from submit → results. Keeps the UI snappy for iterative searching.

Tooling You Can Use

- Deepgram usage/latency: view in logs/console; export to your observability stack.

- Vector store metrics (Pinecone): inspect query latency percentiles and index stats; combine with your app logs to compute Recall@K/nDCG@K using known‑good IDs.

Iteration Plan

Step 1: Own your test sets. Create two small but stable sets:

Quality (labeled snippets for recall/ndcg/mrr)
Speed (queries for latency/QPS). Refresh only when your data distribution changes.

Step 2: Automate evaluations. Use your existing helpers (recall_at_k, ndcg_at_k, mrr) in a nightly job and store results per commit/model/index config.

Step 3: Tune in measured loops.

Reduce WER: domain vocab/phrases, channel‑specific models, audio pre‑processing (VAD/denoise), and prompt‑style hints where supported.
Boost recall without blowing up latency: tune top_k, try hybrid (keyword + vector), adjust ANN parameters (e.g., HNSW efSearch), or pre‑filter by metadata (speaker/file/session) to shrink the candidate set.
Lower latency: warm caches, pin hot namespaces, consider batch queries for multi‑panel UIs, and right‑size index resources.

Step 4: Collect user signals. Lightweight thumbs‑up/down on results or “was this helpful?” improves future evaluation sets and can drive re‑ranking.

Step 5: Guardrails. Add simple alerts when p95 latency or Recall@10 regresses by >X% from the last release.

📝 Practical note for the repo: the UI already supports basic evaluation inputs and shows Recall@K, MRR, and nDCG. Keeping those visible in PRs (screenshots + numbers) makes performance changes reviewable, not just “looks good.”

Conclusion: How to Build a Voice Archive Semantic Search App With Deepgram, Cohere, and Pinecone

Pairing Deepgram Nova-3 transcription with vector embeddings and Pinecone turns raw audio into a searchable knowledge base.

Here’s the workflow you built in this guide:

Accurate transcripts: word-level timestamps, speaker turns, and noise-robust models give you clean text for downstream processing.
Meaning-based retrieval: embeddings let users ask “show me churn risk” instead of guessing the exact words the caller used.
Scalable architecture: transcripts are chunked, embedded, and written to a serverless vector index, so growth is just “add pods/replicas.”
Auditable workflows: every match links back to a timestamped audio snippet, satisfying compliance and QA teams.

The UI lets users upload or link files, scope searches to the current file, and evaluate retrieval quality (nDCG/Recall/MRR) against known‑good IDs.

The net result: less manual scrubbing of recordings, faster insight turnaround, and a foundation you can extend to summarization, RAG, or real-time alerting.

Ready to build your own voice archive search app? Try Deepgram in the Playground or sign up for a free account and get $200 in credits to start using our Speech-to-Text API.

FAQs

Why Use Semantic Search Instead of Keyword Search for Audio?

Speech is messy—synonyms, fillers, and paraphrases abound. Keyword search only finds exact tokens, so “issue with my bill” may miss “problem on my invoice.” Semantic search encodes meaning in vectors, recovering those intent-level matches and typically lifting recall and user satisfaction versus pure keyword filters.

Which Vector Databases Pair Well with Deepgram transcripts?

Any ANN-capable vector DB that accepts 1,024-d float vectors works. Community favourites:

Vector DB	Why people choose it
Pinecone	Serverless/multi-tenant, consistent low-latency, simple filter syntax (used in this repo).
Chroma	Lightweight OSS, easy local dev; swap in Postgres / DuckDB for persistence.
Weaviate	Built-in text + graph hybrid search and REST/GraphQL APIs.

All three expose filters (e.g. file == …) that you can drive from the “Limit to current file” toggle in the repo.

What Are Tips for Higher Accuracy on Noisy Audio?

Model choice: Use noise-robust models like Nova-3; it’s trained on far-field/crosstalk data.
Custom vocabulary: Seed domain names (“LAN port”, “tier-one”) to cut OOV errors.
Pre-clean the file (e.g., denoise or normalize levels with FFmpeg): Normalize volume, remove silence, or split >30 min files into smaller chunks before upload.
Benchmark and iterate: Keep a noisy-audio test set; track WER after each tweak.

Does Semantic Search Trade Speed For Accuracy?

Not in practice. With HNSW or IVF-HNSW indexes:

Single-vector queries on <10 M items often finish in 10-50 ms.
Raising search parameters (e.g. efSearch) bumps recall with a modest latency hit; you can tune for p95 < 200 ms while staying above 90% recall@10.

That’s fast enough for the real-time UX your app surfaces.

Clear File Scope?

If you need to jump back to a global archive search, hit the “Clear file scope” button next to the Limit to current file checkbox—your search form will revert to querying the full namespace.

Ready to build your own voice archive search app? Try Deepgram in the Playground or sign up for a free account and get $200 in credits to start using our Speech-to-Text API.

Building a Voice Archive Search Tool with Deepgram’s STT, Cohere Embeddings, and Pinecone

Table of Contents

Table of Contents

⏩ TL;DR

Introduction

Why Does Voice Archive Search Need Semantic Search?

How the Pipeline Works

What Industries Does Fast Voice Archive Semantic Search Matter To?

1️⃣ Customer Support

2️⃣ Meetings and Knowledge Ops

3️⃣ Compliance and Risk

4️⃣ Recruiting/HR Interviews

Step-by-Step Guide to Building a Voice Archive Search Tool with Semantic Search STT (Deepgram + Cohere + Pinecone)

🛠 Prerequisites

Step 1: Transcribe Audio with Deepgram

Step 2: (Optional) PII Redaction Before Indexing

Step 3: Chunking the Transcript

Step 4: Generate Embeddings (Cohere)

Step 5: Index in Pinecone

Step 6: Query the Index

Step 7: Hook into a Web UI (FastHTML + HTMX + Tailwind CSS)

📎 Full Minimal Pipeline Script

Recommended Next Steps

Scale and Parallelize the Voice Archive Search System

Scaling and Production Notes for the Voice Archive Search App

Troubleshooting (CLI + UI)

Benchmarks and Metrics for Voice Archive Search

What Metrics Do You Measure?

How to Measure in this Repo:

Sensible Targets (Use as Starting Points, Then Tune)

Tooling You Can Use

Iteration Plan

Conclusion: How to Build a Voice Archive Semantic Search App With Deepgram, Cohere, and Pinecone

FAQs

Why Use Semantic Search Instead of Keyword Search for Audio?

Which Vector Databases Pair Well with Deepgram transcripts?

What Are Tips for Higher Accuracy on Noisy Audio?

Does Semantic Search Trade Speed For Accuracy?

Clear File Scope?

Unlock language AI at scale with an API call.

Unlock language AI at scale with an API call.

**How to Measure in this Repo:**