Welcome to the journey of building open source PDF service using Angular (Analogjs), Firestore, Cloud Storage, and CloudRun. This project serves as a platform for sharing my knowledge, continually learning best practices, and simultaneously contributing to the community.
Demo: https://pdfun.xyz
GitHub - dalenguyen/pdfun
The solution is built around GCP ecosystem, it’s better to deploy the project on GCP, so it can access their services. There are two parts of the solution:
- Web UI (Analogjs - Angular): handle user interaction
- Backend (Node - Express): process PDF files
Have you ever wanted to turn your scripts, interviews, or even technical documents into engaging, conversational podcasts—without hiring voice actors? Thanks to advancements in AI, you can now synthesize realistic multi-speaker audio using Google Cloud’s Text-to-Speech (TTS) API and Node.js. In this post, I’ll show you how to create a deep-dive, two-person podcast episode with just a few lines of code.
Architecture Overview
In general, there’re three parts of the process:
- Upload PDF file
- Extract text from PDF file
- Convert the PDF to Podcast script
- Synthesize conversions using multi-speaker TTS
Why Multi-Speaker TTS?
Traditional TTS is monolog—one voice, one perspective. But podcasts thrive on dialogue, debate, and personality. Google Cloud’s latest TTS models, especially the en-US-Studio-Multispeaker
voice, let you alternate between speakers, creating a natural back-and-forth that’s perfect for podcasts, interviews, and storytelling.
Prerequisites
- A Google Cloud Platform (GCP) account
- Node.js and npm installed
-
@google-cloud/text-to-speech
&@google-cloud/vertexai
package -
pdf-parse
package
Generate Podcast from PDF
Now, let's break the steps on how to generate Podcast from an uploaded PDF file.
Extract text from PDF
After download the PDF from Cloud storage, we will need to extract the text.
const pdfBuffer = await pdfParse(downloadedPath)
const extractedText = pdfBuffer.text
Generate the script from PDF Text
We will use gemini-2.0-flash
and VertexAI to generate the script from text.
const model = 'gemini-2.0-flash'
const generativeModel = vertexAI.getGenerativeModel({
model,
generationConfig: {
maxOutputTokens: 2048,
temperature: 0.4,
topP: 0.8,
topK: 40,
},
})
const prompt = `Convert the following text into an engaging podcast conversation between two hosts that will be approximately 3-4 minutes long when read aloud.
Guidelines:
1. Keep the script under ${MAX_CHARS} characters to ensure it works with text-to-speech
2. Create a natural dialogue between two hosts: Yen (the expert) and Dale (the curious learner)
3. Structure the conversation to:
- Start with a brief introduction of the topic
- Have Dale ask thoughtful questions about key points
- Let Yen provide clear, engaging explanations
- Include natural transitions between topics
- End with a brief summary and conclusion
4. Make the conversation feel natural and unscripted
5. Focus on the most important points and main ideas
6. Use conversational language and avoid overly formal tone
Text to convert:
${extractedText}`
console.log(`[PDF to Podcast] Sending prompt to Vertex AI (length: ${prompt.length} characters)`)
const result = await generativeModel.generateContent({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
})
const script = result.response.candidates[0].content.parts[0].text
Break the generated script to turns
We need to have a different format for each voice and speaker.
// Split the conversation into turns
const conversationTurns = cleanScript
.split(/(?=(?:Yen:|Dale:))/)
.map((turn) => {
if (turn.startsWith('Yen:')) {
const text = turn.replace(/^Yen:\s*/, '').trim()
return text
? {
text,
speaker: 'Y',
}
: null
} else if (turn.startsWith('Dale:')) {
const text = turn.replace(/^Dale:\s*/, '').trim()
return text
? {
text,
speaker: 'D',
}
: null
}
return null
})
.filter((turn) => turn !== null && turn.text.length > 0)
Generate and synthesize speeches
All we need to do is to loop through each conversion, generate the audio and combine it into one.
// Process each turn separately
const audioBuffers = []
for (const turn of conversationTurns) {
console.log(`[PDF to Podcast] Processing turn:`, turn)
const [response] = await ttsClient.synthesizeSpeech({
input: { text: turn.text },
voice: {
languageCode: 'en-US',
// https://cloud.google.com/text-to-speech/docs/list-voices-and-types
name: turn.speaker === 'Y' ? 'en-US-Standard-F' : 'en-US-Standard-D',
ssmlGender: turn.speaker === 'Y' ? 'FEMALE' : 'MALE',
},
audioConfig: {
audioEncoding: 'MP3',
speakingRate: 1.0,
pitch: 0.0,
volumeGainDb: 2.0,
},
})
audioBuffers.push(response.audioContent)
}
After that, the audio is ready to served!
Now we a fully running solution for generate podcast from your PDF file on https://pdfun.xyz/pdf-to-podcast.
Questions?
If you have any questions or run into any issues, please don’t hesitate to create an issue on our GitHub repository. Alternatively, you can chat with me. I’m here to help and would love to hear your feedback.
Here is the PR on Github if you wanna see the implementation in detail.
Stay tuned for the next part. Until then, happy coding!
Top comments (0)