Post-production is where most podcast workflows stall. Recording is the easy part. What follows, transcription, metadata writing, thumbnail creation, publishing, is repetitive, time-consuming, and has nothing to do with why someone started a podcast. This project was about rebuilding that back-end so creators could focus on the content itself.
Challenges Identified:
- Complex Podcast Production Workflow: Podcast production crosses too many stages and too many tools. Transcription sits in one place, metadata in another, visual assets somewhere else entirely. Each handoff introduces manual work and delays the path to publishing.
- Manual Metadata Preparation: Writing episode titles, show descriptions, and topic tags from scratch every week is a real-time cost, especially for solo creators or small teams who are already stretched.
- Limited Automation in Audio Processing: Most podcast platforms treated audio as the final product. Converting spoken content into structured, searchable text and then using that text to drive downstream tasks wasn’t something they were built for.
- Difficulty in Creating Visual Assets: Thumbnails require design work most podcasters don’t have. The options were: use a static template forever, commission a designer for every episode, or skip it entirely. None of those are good.
Solution Features:
The platform was redesigned around a conversational workflow that removes the manual steps between recording and publishing:
- Conversational Podcast Workflow: The interaction layer uses RASA CALM combined with OpenAI GPT models. Creators define shows, set up episodes, and trigger publishing steps by talking to the system, describing what they want in plain language rather than navigating a dashboard. For users who found traditional CMSes intimidating, this made a real difference.
- Automated Audio Transcription: Audio is processed through GPT,4o, mini transcribe, which handles natural speech well, including the informal register and tangents that characterise real podcast conversations. Transcripts come back quickly and accurately enough to serve as the foundation for everything downstream.
- AI-Based Metadata Extraction: Once a transcript exists, an OpenAI GPT model reads it and pulls the relevant metadata: title options, a show notes summary, a directory, a ready description, and topic tags. Creators choose what they want and edit it; they’re not writing from nothing.
- AI Generated Episode Artwork: Thumbnails are generated using the Gemini 2.5 Flash Image model, using the extracted metadata as the prompt source. The output needs some curation, but it gives creators a workable starting point in seconds rather than a design task that gets deferred.
- Voice Interaction Enhancements: The architecture supports ASR and TTS throughout. For this first phase, English language processing runs through Whisper for speech recognition and XTTS for voice synthesis. These components make the conversational interface genuinely hands-free when needed.
- Scalable Technology Architecture: The system runs on NextJS and NodeJS with PostgreSQL and MongoDB handling data. The architecture is set up to scale as creator volume grows, without requiring significant re-engineering.
Advantages:
- Simplified Podcast Creation: The conversational interface walks creators through setup and episode management. No dashboard to learn, no form fields to fill in, just a guided back and forth that handles the structural work.
- Automated Content Processing: Transcription and metadata extraction remove the manual writing steps that previously sat between recording and publishing. Creators reviewed and approved outputs rather than producing them from scratch.
- Integrated Asset Generation: Thumbnails come out of the same workflow as the metadata, no separate design step, no separate tool. The asset package comes together in one place.
- Improved Workflow Efficiency: Audio processing, metadata generation, and asset creation run as a single connected pipeline. What used to cross multiple tools and multiple calendar days now completes in a single session.
Conclusion:
The platform was built around a straightforward premise: the hard parts of visual content production shouldn’t require specialised skills or a multi-tool workflow. By combining AI-powered generation, prompt-driven editing, automated enhancement, and built-in content generation, it delivers a single environment where organisations can go from idea to publication, ready asset faster and with less friction. That’s what the platform actually does, and for the teams using it, the difference is visible in output volume and in the time designers spend on genuinely creative decisions rather than mechanical ones.