Creating a Text-to-Speech Podcast Generator: A Deep Dive

In this blog post, we'll explore how to create a system that transforms written text into a natural-sounding podcast conversation. Our system uses AI to generate conversational dialogue and then converts it into audio using text-to-speech technology. The entire solution is built in just around 200 lines of JavaScript code, making it a compact yet powerful tool for content creators.

Prerequisites: Setting Up Your API Keys

Before diving into the implementation, you'll need two essential API keys:

An Anthropic API key for accessing Claude, which handles our conversational AI generation
An ElevenLabs API key for the text-to-speech conversion

You can store these in your .env file:

ANTHROPIC_API_KEY = your_anthropic_key_here;
ELEVENLABS_API_KEY = your_elevenlabs_key_here;

The Main Flow: Orchestrating the Process

First, let's look at how the main components work together:

import { createPodcastScript } from "./llm.js";
import { createAudioStreamFromText } from "./audio.js";
import { config } from "dotenv";

config();

// Example text input
const text = `
  Let me help you explain R strategies in the circular economy...
`;

// Create podcast script and generate audio
const podcastScript = await createPodcastScript(text);
await createAudioStreamFromText(podcastScript, "../output/podcast.mp3");

This code orchestrates our entire process. It takes an input text about circular economy strategies, transforms it into a conversational script, and then generates an audio file. The magic happens in two main steps: script generation and audio creation.

Generating Natural Conversations with AI

Next, let's explore how we create natural-sounding dialogue:

const initialSystemTemplate = `
  You are an experienced podcast host creating authentic, unscripted conversations between Lena and Andy.
  Please begin with short introduction of yourself, mention your name, then proceed into an introduction
  of the topic and begin with the podcast. At the end both should say goodbye to the audience and see you
  next time.

  Rules:
  Include filler words, repairs, and backchanneling
  Keep punctuation conversational
  Add spontaneous reactions and encouraging sounds
  Use realistic filler words like "you know", "like", "I mean"
  Do not use filler words like "hmm", "uhh", "ahh", "haha"

  Required Format:
  [Natural, unscripted dialogue with fillers]
  The generated Strings has be JSON safe.
  You are only allowed to include natural language.
  {format_instructions}
`;

// Define the structure for our podcast conversation
const podcastSchema = z.array(
  z.object({
    speaker: z.enum(["Andy", "Lena"]),
    text: z.string(),
  }),
);

async function createPodcastScript(text) {
  const messages = await prompt.formatMessages({
    format_instructions: parser.getFormatInstructions(),
    input_text: text,
  });
  const response = await model.invoke(messages);
  return parser.parse(response.content);
}

This section handles the transformation of our input text into a natural conversation. The system is instructed to create dialogue that sounds authentic by including common speech patterns like filler words and spontaneous reactions. We use a schema to ensure the output follows a consistent format where each line includes a speaker and their dialogue.

Converting Text to Speech

The final piece is turning our script into audio:

async function createAudioStreamFromText(script = [], outputPath) {
  const chunks = [];
  const requestIds = [];

  for (let i = 0; i < script.length; i++) {
    const { speaker, text } = script[i];

    // Select voice based on speaker
    let voice;
    if (speaker === "Andy") {
      voice = "TX3LPaxmHKxFdv7VOQHJ";
    } else if (speaker === "Lena") {
      voice = "Xb7hH8MSUJKS3MSD8JSS";
    }

    // Generate speech for current line
    const response = await fetch(
      `https://api.elevenlabs.io/v1/text-to-speech/${voice}/stream`,
      {
        method: "POST",
        headers: {
          "xi-api-key": process.env.ELEVENLABS_API_KEY,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          text: `${text}`,
          model_id: "eleven_turbo_v2_5",
          voice_settings: {
            stability: 0.5,
            similarity_boost: 0.5,
            style: 0.05,
            use_speaker_boost: true,
          },
        }),
      },
    );

    // Process the audio stream
    const reader = response.body.getReader();
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      chunks.push(value);
    }

    // Add natural pause between speakers
    if (i < script.length - 1) {
      const silenceBuffer = createSilenceBuffer();
      chunks.push(silenceBuffer);
    }
  }
}

This code handles the text-to-speech conversion. For each line in our script, it:

Determines which voice to use based on the speaker
Sends the text to a text-to-speech API
Receives and processes the audio stream
Adds natural pauses between speakers

The system even includes a feature to add random-length pauses between speakers, making the conversation feel more natural:

function createSilenceBuffer(
  durationMs = 200 + Math.floor(Math.random() * 301),
) {
  const bufferCopies = Math.ceil(durationMs / 1000);
  const buffers = Array(bufferCopies).fill(SILENCE_BUFFER);
  return Buffer.concat(buffers);
}

This function creates varied-length silence buffers between 200-500ms, which helps the conversation feel more natural and less robotic.

Listen to the result here:

Conclusion

Our podcast generator demonstrates how we can combine AI technologies to transform written content into engaging audio conversations. The system handles everything from generating natural-sounding dialogue to converting it into speech with appropriate pacing and flow. With just about 200 lines of JavaScript code (which can be found here), we've created a powerful tool that opens up exciting possibilities for content creation and accessibility. The combination of Anthropic's Claude for dialogue generation and ElevenLabs for voice synthesis makes this a particularly robust solution that's easy to implement and extend.

You can further customize this system by creating your own unique voices using ElevenLabs' voice cloning technology, allowing you to give your podcast a truly distinctive sound. Additionally, if you want to automate content generation, you can build a web scraper to gather source material from your favorite websites or RSS feeds, making it even easier to produce regular podcast content at scale.