3 Ways I Gave My Newsletter a Voice (Literally)

I remember the first time I typed a question into a chatbot and got a decent answer. It felt like a small miracle. Then I added a microphone button to my newsletter site, clicked it, and started talking to my own content archive. That felt like a different kind of miracle entirely. The chatbot was no longer a text box on a screen; it was a conversation. This piece walks through the three specific choices I made to give my newsletter a literal voice, focusing on the architecture, the retrieval pipeline, and the speech services that made it all feel real. The core of this transformation was building a newsletter voice chatbot that could understand spoken questions about every issue I had ever written.

newsletter voice chatbot

The Real-Time Voice Infrastructure That Made It Possible

When I first considered adding voice to my newsletter site, I assumed the technical barrier would be enormous. WebRTC, the technology that powers browser-to-browser audio, is notoriously finicky. Codecs, latency optimization, audio routing, room management — the list of headaches is long. I had no interest in becoming a real-time communication expert. I just wanted someone to be able to click a button and talk to my content.

That is where LiveKit entered the picture. LiveKit is real-time communication infrastructure that you can think of as programmable Zoom. It abstracts away the WebRTC complexity so you do not have to worry about how audio packets travel from one browser to another. Instead, you focus on what happens inside the room once a user joins.

The part that matters most for building a newsletter voice chatbot is the LiveKit agent framework. You write a Python worker that connects to their cloud service and waits. When a user joins a room, LiveKit dispatches your agent into that room. The agent listens to the user’s microphone, processes the incoming speech, thinks about it, and talks back. All of this happens in real time.

The latency surprised me. It feels like talking to another person, not waiting for a computer to process a request. There is no awkward pause where you wonder if the system crashed. The response comes quickly enough that the conversation flows naturally.

How the Three Pieces Fit Together

The architecture consists of three distinct pieces. The first is the API, a FastAPI server that handles text chat requests and generates LiveKit room tokens whenever someone clicks the microphone button. The second is the voice agent itself, a Python worker running the LiveKit agent SDK. This worker connects outbound to LiveKit Cloud and waits for rooms to be dispatched. Inside the agent, several components work together: voice activity detection using Silero VAD, speech-to-text via Azure Speech Services, an LLM for generating responses, and text-to-speech also through Azure. The third piece is the frontend, an Astro component with a microphone button. Clicking it loads the LiveKit client SDK, requests a room token from the API, and connects to the room via WebRTC. The agent joins, and the conversation begins.

Both the API and the voice agent run inside a single Railway container. A bash script starts both processes at boot. If either process dies, the container exits and Railway automatically restarts it. This keeps the setup simple and maintainable.

The RAG Pipeline That Gave the Chatbot a Memory

A voice chatbot is only useful if it knows what you are talking about. Without a retrieval system, the LLM would have to rely on its training data, which may or may not include the specific content of my newsletter issues. I needed a way for the agent to search through every issue I had ever written and find the most relevant chunks of text in response to a spoken question.

This is where the Retrieval-Augmented Generation (RAG) pipeline comes into play. Every time the user says something, the agent follows a specific sequence of steps. First, it transcribes the speech using Azure Speech Services. The audio stream is converted to text in real time. Next, the agent embeds that transcript using the GitHub Models API with the text-embedding-3-small model, which produces 1536-dimensional vectors. Those vectors are then used to search a SQLite vector database powered by sqlite-vec. The database contains chunks of text from every newsletter issue I have published. The search returns the most semantically similar chunks. The agent then rebuilds its system prompt with this fresh context, generates a response using GPT-4.1-mini via GitHub Models, and speaks that response back using Azure’s neural text-to-speech.

This entire pipeline runs per utterance. The agent’s knowledge stays current with whatever the user is asking about. If the first question is about a specific topic and the follow-up is about something completely different, the agent retrieves new context for the second question. It does not get stuck on whatever the first question was.

The Hybrid Retrieval Trick That Solved a Hidden Problem

Vector search is powerful, but it has a blind spot. Semantic similarity cannot understand ordering or recency. If a user asks “what is the latest issue?” or “tell me about issue number fifteen,” the vector database has no way of knowing which chunk corresponds to the most recent publication. The embeddings capture meaning, not chronology.

The solution I found was surprisingly simple. At startup, the agent queries the database for all newsletter issue URLs. It extracts the issue numbers from those URLs and injects a content index directly into every system prompt. The index looks something like this: Available newsletter issues: issue-1, issue-2, issue-3, all the way up to issue-20. It also includes a line stating that the latest issue is issue-20 and that the total number of issues is 20.

Now the LLM gets both semantic context from the vector search and structural metadata it cannot learn from embeddings. If someone asks “what is the latest issue?” the LLM reads the system prompt and knows the answer immediately. If someone asks “tell me about GitHub Copilot,” the vector search finds the relevant chunks. This hybrid retrieval approach handles both types of questions without needing a separate database or a complex query parser.

Why Azure Speech Services for Both Ends of the Voice Pipeline

The voice pipeline has two critical points where quality matters. The first is speech-to-text: turning the user’s spoken words into text the LLM can understand. The second is text-to-speech: turning the LLM’s response back into natural-sounding speech. If either of these steps produces poor quality, the entire conversation feels broken.

I chose Azure Speech Services for both ends of the pipeline. The speech-to-text accuracy is high enough that I rarely need to correct a transcription. It handles background noise surprisingly well and adapts to different speaking speeds. The streaming support means the agent can start processing the audio before the user has finished speaking, which cuts perceived latency in half.

The text-to-speech side uses Azure’s neural voices. The voice I selected is en-US-JennyNeural. It sounds human. Not robotic, not stuck in the uncanny valley. It has natural intonation, appropriate pauses, and a warm tone. When the agent speaks back an answer, it does not sound like a computer reading a script. It sounds like someone who knows what they are talking about.

You may also enjoy reading: NY Bans Government Employees From Insider Trading.

LiveKit integrates with Azure Speech Services through its plugin system. The livekit-plugins-azure package handles the connection between the voice agent and Azure’s APIs. This meant I did not have to write any custom glue code. The plugin handles streaming, buffering, and error recovery automatically.

The Practical Benefit of This Choice

Using a single provider for both speech-to-text and text-to-speech simplified the architecture significantly. I did not have to manage separate API keys, different authentication methods, or conflicting SDK versions. The Azure plugin for LiveKit works with both services out of the box. When I need to change a voice or adjust a recognition model, I update a single configuration file instead of hunting through multiple services.

The quality trade-off was minimal. Azure’s speech-to-text accuracy competes with the best available services, and the neural voices are among the most natural I have tested. For a newsletter voice chatbot where the user expects a human-like interaction, this combination delivers the right balance of accuracy and naturalness.

The Surprising Simplicity of the Whole Setup

If you had told me a year ago that I could build a real-time voice chatbot for my newsletter in a single evening, I would have laughed. The web is full of tutorials that make voice AI sound like something only a team of engineers at a large company could build. The reality is different.

The text chat version of the chatbot took one evening to build. Adding the microphone button and the voice pipeline took another evening. The hardest part was not the code — it was understanding how the pieces fit together. Once I understood that LiveKit handles the real-time communication, Azure handles the speech processing, and GitHub Models handles the embeddings and LLM calls, the architecture became clear.

The single Railway container approach keeps deployment trivial. I do not need a Kubernetes cluster or a complex CI/CD pipeline. A bash script starts both processes, and Railway handles the rest. If the container crashes, it restarts. If I need to update the agent, I push a new build and Railway deploys it.

What This Means for Newsletter Creators

I am not a large operation. I write a newsletter because I enjoy sharing ideas and learning in public. The idea of adding a voice chatbot felt like something that would take weeks of development time and require specialized knowledge I did not have. It did not. The tools have matured to the point where a solo creator can build something that feels genuinely futuristic.

The immediate benefit for readers is obvious. Instead of scrolling through a search bar or trying to remember which issue contained a specific topic, they can ask a question out loud and get an answer with sources. The experience feels more like asking a knowledgeable friend than querying a database. That shift in feel is worth the setup time.

For anyone considering a similar project, the advice is simple: start with the text chatbot first. Get the RAG pipeline working. Make sure the answers are accurate and the sources are clear. Then add the voice layer on top. The voice adds the magic, but the text foundation is what makes the answers useful.

The newsletter voice chatbot I built is not perfect. It occasionally mishears a word or retrieves a chunk that is not quite relevant. But the latency is low enough and the voice is natural enough that the conversation feels real. That is the bar I set for myself, and I crossed it faster than I expected.

Add Comment