
Intro
If you're reading this post then you probably want to add audio versions to your blog posts. Perhaps you've noticed more sites offering "listen to this article" features, or maybe you just want to make your content more accessible.
Whatever your reason, I'll show you exactly how I built a complete text-to-speech pipeline that automatically generates high-quality audio for every post on this blog—including the one you're reading right now.
This post assumes the following of you:
- You have a Node.js-based blog or can integrate Node scripts into your build process
- You have an OpenAI API key (for their TTS service)
- You have an AWS account with S3 access
- You're comfortable with basic command-line tools
- You want professional-quality audio without manual recording
Alright, let's get to it.
The Architecture
Here's how the pipeline works end-to-end:
The beauty of this system is that it's fully automated. Write a post, run the build, and audio appears. No manual steps, no recording equipment, just code.
Text Processing: Making Markdown Sound Natural
The first challenge is that blog posts aren't written to be read aloud. They contain:
- Code blocks that shouldn't be narrated
- Abbreviations like "API" or "AWS"
- Special formatting like
$53kor "Dec 2021" - Emojis and special characters
- Links and images
Here's how I handle text extraction using the Compromise NLP library (full source):
JavaScript1// Extract and normalize text content from markdown 2function extractTextFromMarkdown(markdown) { 3 // Remove frontmatter 4 let content = markdown.replace(/^---[\s\S]*?---\n/, ''); 5 6 // Remove all emojis and special Unicode characters 7 content = content.replace(/[\u{1F300}-\u{1F9FF}]|[\u{1F600}-\u{1F64F}]|[\u{1F680}-\u{1F6FF}]|[\u{2600}-\u{26FF}]|[\u{2700}-\u{27BF}]|[\u{1F900}-\u{1F9FF}]|[\u{1F1E0}-\u{1F1FF}]/gu, ''); 8 9 // Remove code blocks entirely 10 content = content.replace(/```[\s\S]*?```/g, ''); 11 12 // Handle inline code - replace with the word or phrase without backticks 13 content = content.replace(/`([^`]+)`/g, '$1'); 14 15 // Extract link text, removing the URL 16 content = content.replace(/\[([^\]]+)\]\([^)]+\)/g, '$1'); 17 18 // Use compromise to process the text 19 let doc = nlp(content); 20 21 // Expand contractions 22 doc.contractions().expand(); 23 24 // Process money values 25 const moneyMatches = doc.match('$#Value'); 26 moneyMatches.forEach(m => { 27 const text = m.text(); 28 if (text.match(/\$\d+k/i)) { 29 const num = text.match(/\d+/)[0]; 30 m.replaceWith(`${num} thousand dollars`); 31 } 32 }); 33 34 // Handle common abbreviations 35 const abbreviations = { 36 'API': 'A P I', 37 'URL': 'U R L', 38 'HTTP': 'H T T P', 39 'HTTPS': 'H T T P S', 40 'AWS': 'A W S', 41 'GPU': 'G P U', 42 // ... many more 43 }; 44 45 return content.trim(); 46}
Example Processing Output
Here's what the normalization does to actual text:
Original:
Text1This year, I successfully paid off my private student loans by paying down the remaining $53k I had left. 2I've been working on the API for NormConf using AWS.
Processed:
Text1This year, I successfully paid off my private student loans by paying down the remaining 53 thousand dollars I had left. 2I have been working on the A P I for NormConf using A W S.
The difference is subtle but crucial for natural-sounding speech.
Chunking: Working Around OpenAI's 4096 Character Limit
OpenAI's TTS API has a hard limit of 4096 characters per request. For longer posts (like my student loans story at 43,138 characters), we need intelligent chunking (view on GitHub):
JavaScript1function splitTextIntoChunks(text, maxChars) { 2 if (text.length <= maxChars) { 3 return [text]; 4 } 5 6 const chunks = []; 7 8 // First try to split by double newlines (paragraphs) 9 const paragraphs = text.split(/\n\n+/); 10 let currentChunk = ''; 11 12 for (const paragraph of paragraphs) { 13 const trimmedParagraph = paragraph.trim(); 14 if (!trimmedParagraph) continue; 15 16 // If a single paragraph is too long, split by sentences 17 if (trimmedParagraph.length > maxChars) { 18 if (currentChunk.trim()) { 19 chunks.push(currentChunk.trim()); 20 currentChunk = ''; 21 } 22 23 // Use NLP to split by sentences 24 const doc = nlp(trimmedParagraph); 25 const sentences = doc.sentences().out('array'); 26 27 for (const sentence of sentences) { 28 if ((currentChunk + ' ' + sentence).length > maxChars && currentChunk.length > 0) { 29 chunks.push(currentChunk.trim()); 30 currentChunk = sentence; 31 } else { 32 currentChunk += (currentChunk ? ' ' : '') + sentence; 33 } 34 } 35 } else { 36 // Check if adding this paragraph would exceed the limit 37 const separator = currentChunk ? '\n\n' : ''; 38 const combined = currentChunk + separator + trimmedParagraph; 39 40 if (combined.length > maxChars && currentChunk.length > 0) { 41 chunks.push(currentChunk.trim()); 42 currentChunk = trimmedParagraph; 43 } else { 44 currentChunk = combined; 45 } 46 } 47 } 48 49 return chunks; 50}
This approach ensures we:
- Never break in the middle of a sentence
- Prefer paragraph boundaries when possible
- Handle edge cases like single paragraphs longer than 4096 chars
Audio Generation and Concatenation
Once we have our chunks, we generate audio for each and use FFmpeg to concatenate them seamlessly:
JavaScript1// Generate audio for each chunk 2const chunkPaths = []; 3for (let i = 0; i < chunks.length; i++) { 4 const chunkPath = path.join(AUDIO_OUTPUT_DIR, `${filename}_chunk_${i}.mp3`); 5 console.log(` Generating chunk ${i + 1}/${chunks.length} (${chunks[i].length} chars)...`); 6 7 await generateAudio(chunks[i], chunkPath); 8 chunkPaths.push(chunkPath); 9} 10 11// Concatenate with FFmpeg 12if (hasFfmpeg) { 13 console.log(` Concatenating ${chunks.length} chunks with ffmpeg...`); 14 await concatenateAudioFiles(chunkPaths, audioPath); 15 16 // Clean up chunk files 17 for (const chunkPath of chunkPaths) { 18 await fs.unlink(chunkPath); 19 } 20}
The FFmpeg concatenation ensures there are no gaps or glitches between chunks—the audio flows naturally as if it were generated in one piece.
Caching: Don't Regenerate Unchanged Content
To avoid unnecessary API calls and costs, I implement content-based caching:
JavaScript1// Calculate hash of processed text 2const contentHash = calculateHash(textContent); 3 4// Check if audio already exists and content hasn't changed 5if (!forceRegenerate && cache[filename] && cache[filename].hash === contentHash) { 6 try { 7 await fs.access(audioPath); 8 console.log(` ✓ Audio already exists and is up to date`); 9 return { filename, audioFilename, status: 'cached' }; 10 } catch { 11 console.log(` Audio file missing, regenerating...`); 12 } 13}
The cache tracks:
- Content hash (MD5 of processed text)
- Generation timestamp
- Character count
- Number of chunks
- Whether the file is complete (all chunks concatenated)
S3 Upload and Distribution
Once audio files are generated, they're uploaded to S3 for global distribution:
JavaScript1// Upload to S3 with caching headers 2const command = new PutObjectCommand({ 3 Bucket: BUCKET_NAME, 4 Key: `audio/${filename}`, 5 Body: fileContent, 6 ContentType: 'audio/mpeg', 7 CacheControl: 'public, max-age=31536000', // Cache for 1 year 8 Metadata: { 9 'generated-by': 'blog-audio-generator', 10 'source': 'openai-tts' 11 } 12}); 13 14await s3Client.send(command);
The upload script also generates a manifest file mapping post slugs to S3 URLs:
JSON1{ 2 "2022_reflection": "https://tech-notes-blog.s3.us-west-2.amazonaws.com/audio/2022_reflection.mp3", 3 "building_an_https_model_apI_for_cheap": "https://tech-notes-blog.s3.us-west-2.amazonaws.com/audio/building_an_https_model_apI_for_cheap.mp3", 4 // ... more posts 5}
Frontend: The Audio Player Component
The React audio player provides a clean interface with all the controls readers expect (full component):
JSX1const AudioPlayer = ({ audioUrl, title }) => { 2 const [isPlaying, setIsPlaying] = useState(false); 3 const [currentTime, setCurrentTime] = useState(0); 4 const [duration, setDuration] = useState(0); 5 const [playbackRate, setPlaybackRate] = useState(1); 6 7 // ... audio event handlers 8 9 return ( 10 <div className="audio-player"> 11 <div className="audio-player-header"> 12 <span className="audio-player-title">{title}</span> 13 </div> 14 15 <div className="audio-player-controls"> 16 <button onClick={togglePlayPause}> 17 {isPlaying ? <PauseIcon /> : <PlayIcon />} 18 </button> 19 20 <div className="audio-player-time"> 21 {formatTime(currentTime)} / {formatTime(duration)} 22 </div> 23 24 <div className="audio-player-progress" onClick={handleProgressClick}> 25 <div 26 className="audio-player-progress-fill" 27 style={{ width: `${progressPercentage}%` }} 28 /> 29 </div> 30 31 <button onClick={handleSpeedChange}> 32 {playbackRate}x 33 </button> 34 </div> 35 </div> 36 ); 37};
Features include:
- Play/pause toggle
- Progress bar with seeking
- Time display (current/total)
- Playback speed control (1x, 1.25x, 1.5x, 1.75x, 2x)
- Loading states and error handling
Results and Performance
The complete pipeline processes all 14 posts on this blog in about 15 minutes:
- 11 posts required chunking (2-11 chunks each)
- Total of 33 audio chunks generated
- Longest post: 43,138 characters (11 chunks)
- All audio seamlessly concatenated with FFmpeg
- Zero manual intervention required
Cost Analysis
OpenAI TTS pricing:
- tts-1-hd: $0.030 per 1,000 characters
- Average blog post: ~10,000 characters = $0.30
- Total for 14 posts: ~$4.20
AWS S3 costs:
- Storage: ~100MB total = $0.0023/month
- Bandwidth: Depends on traffic, but audio files are cached for 1 year
The Command Line Interface
Simple npm scripts make the whole process painless:
Bash1# Generate audio for all posts 2npm run generate-audio 3 4# Generate for specific post 5npm run generate-audio post-name 6 7# Force regenerate (ignore cache) 8npm run generate-audio post-name --force 9 10# Upload to S3 11npm run upload-audio 12 13# Full pipeline 14npm run process-audio
Lessons Learned
- Text normalization is crucial - Raw markdown sounds terrible when read aloud
- Smart chunking matters - Breaking at sentence boundaries maintains flow
- Caching saves money - Content-based hashing prevents unnecessary regeneration
- FFmpeg is your friend - Seamless audio concatenation with one command
- S3 + CloudFront works great - Fast global delivery with minimal configuration
Try It Yourself
If you want to implement this for your own blog, you'll need:
- OpenAI API key (get one at platform.openai.com)
- AWS account with S3 bucket
- Node.js environment
- FFmpeg installed locally
- About an hour to set everything up
The complete implementation is running on this blog—in fact, you can listen to this very post by clicking the audio player at the top.
Source Code
All the code for this TTS pipeline is available on GitHub:
- Audio Generation Script: scripts/generate-audio.js - Core logic for text extraction, NLP processing, chunking, and OpenAI API integration
- S3 Upload Script: scripts/upload-audio-s3.js - Handles uploading audio files to S3 and generating the manifest
- Audio Player Component: src/components/AudioPlayer.tsx - React component with full playback controls
- Post Detail Integration: src/components/PostDetail.tsx - Shows how the audio player is integrated into blog posts
- Audio Manifest: src/config/audioManifest.json - Maps post slugs to S3 audio URLs
- Setup Documentation: docs/AUDIO_SETUP.md - Detailed setup instructions
Happy listening!