Adding Text-to-Speech to Your Blog: Building an OpenAI TTS Pipeline with Smart Chunking and AWS S3

Listen to this post
0:00 / 0:00

TTS Pipeline Architecture

Intro

If you're reading this post then you probably want to add audio versions to your blog posts. Perhaps you've noticed more sites offering "listen to this article" features, or maybe you just want to make your content more accessible.

Whatever your reason, I'll show you exactly how I built a complete text-to-speech pipeline that automatically generates high-quality audio for every post on this blog—including the one you're reading right now.

This post assumes the following of you:

  • You have a Node.js-based blog or can integrate Node scripts into your build process
  • You have an OpenAI API key (for their TTS service)
  • You have an AWS account with S3 access
  • You're comfortable with basic command-line tools
  • You want professional-quality audio without manual recording

Alright, let's get to it.

The Architecture

Here's how the pipeline works end-to-end:

📄Markdown Post.md file📝Text ExtractionRemove code, images, links🧠NLP ProcessingExpand "I've" to "I have"Text > 4096chars?YesNo✂️Chunking LogicSplit by paragraphs/sentences🎵OpenAI TTS APIGenerate MP3Multiplechunks?YesNo🔗FFmpeg ConcatMerge chunks☁️S3 Upload &Manifest Update▶️Audio Player UIReact + HTML5

The beauty of this system is that it's fully automated. Write a post, run the build, and audio appears. No manual steps, no recording equipment, just code.

Text Processing: Making Markdown Sound Natural

The first challenge is that blog posts aren't written to be read aloud. They contain:

  • Code blocks that shouldn't be narrated
  • Abbreviations like "API" or "AWS"
  • Special formatting like $53k or "Dec 2021"
  • Emojis and special characters
  • Links and images

Here's how I handle text extraction using the Compromise NLP library (full source):

JavaScript
1// Extract and normalize text content from markdown
2function extractTextFromMarkdown(markdown) {
3  // Remove frontmatter
4  let content = markdown.replace(/^---[\s\S]*?---\n/, '');
5
6  // Remove all emojis and special Unicode characters
7  content = content.replace(/[\u{1F300}-\u{1F9FF}]|[\u{1F600}-\u{1F64F}]|[\u{1F680}-\u{1F6FF}]|[\u{2600}-\u{26FF}]|[\u{2700}-\u{27BF}]|[\u{1F900}-\u{1F9FF}]|[\u{1F1E0}-\u{1F1FF}]/gu, '');
8
9  // Remove code blocks entirely
10  content = content.replace(/```[\s\S]*?```/g, '');
11
12  // Handle inline code - replace with the word or phrase without backticks
13  content = content.replace(/`([^`]+)`/g, '$1');
14
15  // Extract link text, removing the URL
16  content = content.replace(/\[([^\]]+)\]\([^)]+\)/g, '$1');
17
18  // Use compromise to process the text
19  let doc = nlp(content);
20
21  // Expand contractions
22  doc.contractions().expand();
23
24  // Process money values
25  const moneyMatches = doc.match('$#Value');
26  moneyMatches.forEach(m => {
27    const text = m.text();
28    if (text.match(/\$\d+k/i)) {
29      const num = text.match(/\d+/)[0];
30      m.replaceWith(`${num} thousand dollars`);
31    }
32  });
33
34  // Handle common abbreviations
35  const abbreviations = {
36    'API': 'A P I',
37    'URL': 'U R L',
38    'HTTP': 'H T T P',
39    'HTTPS': 'H T T P S',
40    'AWS': 'A W S',
41    'GPU': 'G P U',
42    // ... many more
43  };
44
45  return content.trim();
46}

Example Processing Output

Here's what the normalization does to actual text:

Original:

Text
1This year, I successfully paid off my private student loans by paying down the remaining $53k I had left.
2I've been working on the API for NormConf using AWS.

Processed:

Text
1This year, I successfully paid off my private student loans by paying down the remaining 53 thousand dollars I had left.
2I have been working on the A P I for NormConf using A W S.

The difference is subtle but crucial for natural-sounding speech.

Chunking: Working Around OpenAI's 4096 Character Limit

OpenAI's TTS API has a hard limit of 4096 characters per request. For longer posts (like my student loans story at 43,138 characters), we need intelligent chunking (view on GitHub):

JavaScript
1function splitTextIntoChunks(text, maxChars) {
2  if (text.length <= maxChars) {
3    return [text];
4  }
5
6  const chunks = [];
7
8  // First try to split by double newlines (paragraphs)
9  const paragraphs = text.split(/\n\n+/);
10  let currentChunk = '';
11
12  for (const paragraph of paragraphs) {
13    const trimmedParagraph = paragraph.trim();
14    if (!trimmedParagraph) continue;
15
16    // If a single paragraph is too long, split by sentences
17    if (trimmedParagraph.length > maxChars) {
18      if (currentChunk.trim()) {
19        chunks.push(currentChunk.trim());
20        currentChunk = '';
21      }
22
23      // Use NLP to split by sentences
24      const doc = nlp(trimmedParagraph);
25      const sentences = doc.sentences().out('array');
26
27      for (const sentence of sentences) {
28        if ((currentChunk + ' ' + sentence).length > maxChars && currentChunk.length > 0) {
29          chunks.push(currentChunk.trim());
30          currentChunk = sentence;
31        } else {
32          currentChunk += (currentChunk ? ' ' : '') + sentence;
33        }
34      }
35    } else {
36      // Check if adding this paragraph would exceed the limit
37      const separator = currentChunk ? '\n\n' : '';
38      const combined = currentChunk + separator + trimmedParagraph;
39
40      if (combined.length > maxChars && currentChunk.length > 0) {
41        chunks.push(currentChunk.trim());
42        currentChunk = trimmedParagraph;
43      } else {
44        currentChunk = combined;
45      }
46    }
47  }
48
49  return chunks;
50}

This approach ensures we:

  1. Never break in the middle of a sentence
  2. Prefer paragraph boundaries when possible
  3. Handle edge cases like single paragraphs longer than 4096 chars

Audio Generation and Concatenation

Once we have our chunks, we generate audio for each and use FFmpeg to concatenate them seamlessly:

JavaScript
1// Generate audio for each chunk
2const chunkPaths = [];
3for (let i = 0; i < chunks.length; i++) {
4  const chunkPath = path.join(AUDIO_OUTPUT_DIR, `${filename}_chunk_${i}.mp3`);
5  console.log(`  Generating chunk ${i + 1}/${chunks.length} (${chunks[i].length} chars)...`);
6
7  await generateAudio(chunks[i], chunkPath);
8  chunkPaths.push(chunkPath);
9}
10
11// Concatenate with FFmpeg
12if (hasFfmpeg) {
13  console.log(`  Concatenating ${chunks.length} chunks with ffmpeg...`);
14  await concatenateAudioFiles(chunkPaths, audioPath);
15
16  // Clean up chunk files
17  for (const chunkPath of chunkPaths) {
18    await fs.unlink(chunkPath);
19  }
20}

The FFmpeg concatenation ensures there are no gaps or glitches between chunks—the audio flows naturally as if it were generated in one piece.

Caching: Don't Regenerate Unchanged Content

To avoid unnecessary API calls and costs, I implement content-based caching:

JavaScript
1// Calculate hash of processed text
2const contentHash = calculateHash(textContent);
3
4// Check if audio already exists and content hasn't changed
5if (!forceRegenerate && cache[filename] && cache[filename].hash === contentHash) {
6  try {
7    await fs.access(audioPath);
8    console.log(`  ✓ Audio already exists and is up to date`);
9    return { filename, audioFilename, status: 'cached' };
10  } catch {
11    console.log(`  Audio file missing, regenerating...`);
12  }
13}

The cache tracks:

  • Content hash (MD5 of processed text)
  • Generation timestamp
  • Character count
  • Number of chunks
  • Whether the file is complete (all chunks concatenated)

S3 Upload and Distribution

Once audio files are generated, they're uploaded to S3 for global distribution:

JavaScript
1// Upload to S3 with caching headers
2const command = new PutObjectCommand({
3  Bucket: BUCKET_NAME,
4  Key: `audio/${filename}`,
5  Body: fileContent,
6  ContentType: 'audio/mpeg',
7  CacheControl: 'public, max-age=31536000', // Cache for 1 year
8  Metadata: {
9    'generated-by': 'blog-audio-generator',
10    'source': 'openai-tts'
11  }
12});
13
14await s3Client.send(command);

The upload script also generates a manifest file mapping post slugs to S3 URLs:

JSON
1{
2  "2022_reflection": "https://tech-notes-blog.s3.us-west-2.amazonaws.com/audio/2022_reflection.mp3",
3  "building_an_https_model_apI_for_cheap": "https://tech-notes-blog.s3.us-west-2.amazonaws.com/audio/building_an_https_model_apI_for_cheap.mp3",
4  // ... more posts
5}

Frontend: The Audio Player Component

The React audio player provides a clean interface with all the controls readers expect (full component):

JSX
1const AudioPlayer = ({ audioUrl, title }) => {
2  const [isPlaying, setIsPlaying] = useState(false);
3  const [currentTime, setCurrentTime] = useState(0);
4  const [duration, setDuration] = useState(0);
5  const [playbackRate, setPlaybackRate] = useState(1);
6
7  // ... audio event handlers
8
9  return (
10    <div className="audio-player">
11      <div className="audio-player-header">
12        <span className="audio-player-title">{title}</span>
13      </div>
14
15      <div className="audio-player-controls">
16        <button onClick={togglePlayPause}>
17          {isPlaying ? <PauseIcon /> : <PlayIcon />}
18        </button>
19
20        <div className="audio-player-time">
21          {formatTime(currentTime)} / {formatTime(duration)}
22        </div>
23
24        <div className="audio-player-progress" onClick={handleProgressClick}>
25          <div
26            className="audio-player-progress-fill"
27            style={{ width: `${progressPercentage}%` }}
28          />
29        </div>
30
31        <button onClick={handleSpeedChange}>
32          {playbackRate}x
33        </button>
34      </div>
35    </div>
36  );
37};

Features include:

  • Play/pause toggle
  • Progress bar with seeking
  • Time display (current/total)
  • Playback speed control (1x, 1.25x, 1.5x, 1.75x, 2x)
  • Loading states and error handling

Results and Performance

The complete pipeline processes all 14 posts on this blog in about 15 minutes:

  • 11 posts required chunking (2-11 chunks each)
  • Total of 33 audio chunks generated
  • Longest post: 43,138 characters (11 chunks)
  • All audio seamlessly concatenated with FFmpeg
  • Zero manual intervention required

Cost Analysis

OpenAI TTS pricing:

  • tts-1-hd: $0.030 per 1,000 characters
  • Average blog post: ~10,000 characters = $0.30
  • Total for 14 posts: ~$4.20

AWS S3 costs:

  • Storage: ~100MB total = $0.0023/month
  • Bandwidth: Depends on traffic, but audio files are cached for 1 year

The Command Line Interface

Simple npm scripts make the whole process painless:

Bash
1# Generate audio for all posts
2npm run generate-audio
3
4# Generate for specific post
5npm run generate-audio post-name
6
7# Force regenerate (ignore cache)
8npm run generate-audio post-name --force
9
10# Upload to S3
11npm run upload-audio
12
13# Full pipeline
14npm run process-audio

Lessons Learned

  1. Text normalization is crucial - Raw markdown sounds terrible when read aloud
  2. Smart chunking matters - Breaking at sentence boundaries maintains flow
  3. Caching saves money - Content-based hashing prevents unnecessary regeneration
  4. FFmpeg is your friend - Seamless audio concatenation with one command
  5. S3 + CloudFront works great - Fast global delivery with minimal configuration

Try It Yourself

If you want to implement this for your own blog, you'll need:

  1. OpenAI API key (get one at platform.openai.com)
  2. AWS account with S3 bucket
  3. Node.js environment
  4. FFmpeg installed locally
  5. About an hour to set everything up

The complete implementation is running on this blog—in fact, you can listen to this very post by clicking the audio player at the top.

Source Code

All the code for this TTS pipeline is available on GitHub:

Happy listening!