Automation

YouTube to Podcast Automation with Gemini AI

Automated skill that converts YouTube URLs to Xiaoyuzhou podcasts using AI-generated covers and browser automation.

February 23, 2026
7 min read
By ClawList Team

YouTube to Podcast in One Click: Building an AI-Powered Automation Skill with Gemini and Playwright

Published on ClawList.io | Category: Automation | By ClawList Editorial Team


If you've ever wished you could transform a YouTube video into a fully published podcast episode — complete with an AI-generated cover image — without lifting more than a finger, this OpenClaw skill is about to become your new favorite workflow. Developer @wlzh recently shipped a remarkable automation skill that takes a YouTube URL as input and automatically publishes the result as a podcast episode on Xiaoyuzhou (小宇宙), one of China's most popular podcast platforms. The entire pipeline is powered by Gemini AI, Playwright browser automation, and a modular skill architecture that's worth dissecting in detail.

Let's break down exactly how this works, why it matters, and how you can adapt the same pattern for your own automation workflows.


What This Skill Does: The Full Pipeline

At its core, this skill chains together four distinct capabilities into a single, seamless flow. Understanding each stage helps you appreciate not just the end result, but the engineering decisions that make it reliable and reusable.

Stage 1 — Audio Extraction via Skill Reuse

The first step is getting the audio out of YouTube. Rather than reinventing the wheel, @wlzh deliberately reuses an existing video-downloader skill to pull high-quality audio from the provided YouTube URL. This is a textbook example of modular skill composition — a core principle in OpenClaw development where individual skills act like building blocks that can be stacked together.

Input: YouTube URL
         ↓
[video-downloader skill]
         ↓
Output: High-quality audio file (.mp3 / .m4a)

By reusing an existing, battle-tested skill rather than embedding new download logic, the overall system stays lean and maintainable. If the download logic ever needs updating, you change it in one place and every skill that depends on it inherits the fix automatically.

Stage 2 — Intelligent Content Processing

Raw YouTube metadata is rarely podcast-ready. Video titles often contain channel names, promotional suffixes, or embedded URLs that look fine in a video description but feel out of place in a podcast listing. This skill automatically:

  • Cleans and reformats the title to match podcast conventions
  • Strips raw URLs from the description/details field so listeners don't encounter broken links in their podcast apps
  • Normalizes the detail text into a clean episode summary

This kind of lightweight NLP pre-processing is easy to overlook but dramatically improves the end-user experience. A podcast titled "My REACTION to This Video!! (GONE WRONG) 😱 | Subscribe Now → https://bit.ly/xyz" benefits enormously from a pass through intelligent formatting logic before it goes live.

Stage 3 — AI-Generated Cover Art with Gemini API

This is arguably the most impressive piece of the pipeline. Generating a custom podcast cover for every episode manually is time-consuming and requires design skills most developers don't have on tap. This skill solves that by calling the Gemini API with the cleaned title and episode description as context, then generating a cover image that's visually relevant to the episode content.

# Conceptual flow for Gemini cover generation
import google.generativeai as genai

def generate_podcast_cover(title: str, description: str) -> bytes:
    prompt = f"""
    Create a professional podcast cover image for an episode titled:
    "{title}"
    
    Episode summary: "{description}"
    
    Style: Clean, modern, podcast-ready. 1:1 aspect ratio.
    """
    
    model = genai.GenerativeModel("gemini-2.0-flash-exp-image-generation")
    response = model.generate_content(prompt)
    
    # Extract and return image bytes
    return response.candidates[0].content.parts[0].inline_data.data

The result is a unique, context-aware cover image for every episode — no Canva templates, no stock photo subscriptions, no manual work. For high-volume creators or aggregators publishing dozens of episodes per week, this alone represents hours of saved effort.

Stage 4 — Full Browser Automation with Playwright

The final stage is where everything comes together. Rather than relying on an official API (Xiaoyuzhou doesn't expose a public publishing API), this skill uses Playwright to automate the browser-based publishing workflow entirely. Playwright navigates the Xiaoyuzhou creator dashboard, fills in all the required fields, uploads the audio file and cover image, and submits the episode for publication — all without human intervention.

// Conceptual Playwright automation snippet
const { chromium } = require('playwright');

async function publishToPodcast(audioPath, coverPath, title, description) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  // Navigate to creator dashboard
  await page.goto('https://www.xiaoyuzhoufm.com/creator');
  
  // Upload audio file
  await page.setInputFiles('input[type="file"][accept*="audio"]', audioPath);
  
  // Fill episode metadata
  await page.fill('#episode-title', title);
  await page.fill('#episode-description', description);
  
  // Upload AI-generated cover
  await page.setInputFiles('input[type="file"][accept*="image"]', coverPath);
  
  // Publish
  await page.click('button[data-action="publish"]');
  await page.waitForSelector('.publish-success-indicator');
  
  await browser.close();
}

Browser automation via Playwright is a powerful fallback for any platform that lacks a developer API. It's the same technique used in enterprise RPA (Robotic Process Automation) tools — applied here in a lightweight, developer-friendly skill format.


Why This Architecture Matters for AI Automation Developers

This skill isn't just a neat party trick. It demonstrates several principles that are directly applicable to building production-grade automation workflows:

1. Skill Modularity Scales By reusing the video-downloader skill instead of embedding download logic, the author kept the skill focused and composable. When you design your own skills, always ask: "Is this capability something another skill might need?" If yes, extract it.

2. AI Fills Gaps APIs Can't Not every service has a clean API. Gemini handles the creative gap (cover generation), and Playwright handles the integration gap (no public API). Together, they make the impossible workflow possible.

3. Content Normalization Is Non-Negotiable Garbage in, garbage out. The intelligent pre-processing step — stripping URLs, reformatting titles — is what separates a polished automation from a fragile script that works 60% of the time. Build content normalization into every pipeline that touches user-facing output.

4. One Input, Full Outcome The user experience here is beautifully simple: paste a YouTube URL, get a published podcast episode. Every layer of complexity is hidden behind that single input. This is the gold standard for automation design — maximum outcome from minimum user effort.


Practical Use Cases

This pattern opens up a range of compelling real-world applications:

  • Content repurposing agencies converting client YouTube content to podcast feeds automatically
  • Newsletter operators turning weekly YouTube roundups into podcast episodes for audio-first audiences
  • Language learners archiving educational YouTube channels as personal podcast libraries
  • Indie creators maintaining simultaneous video and audio presences without doubling their workflow
  • News aggregators auto-publishing daily briefing videos as on-demand podcast episodes

Conclusion

What @wlzh has built here is a beautiful illustration of how modern AI automation skills can eliminate entire categories of manual work. By combining skill reuse, Gemini-powered content generation, and Playwright browser automation into a single OpenClaw skill, the result is a YouTube-to-podcast pipeline that would previously have required a dedicated team or expensive SaaS tooling.

For developers building on OpenClaw, the key takeaway is this: the most powerful skills aren't necessarily the most complex ones. They're the ones that identify exactly where human effort is being wasted, and eliminate it ruthlessly. A YouTube URL in, a published podcast out — that's the dream, and this skill delivers it.

If you're inspired to build similar automation workflows, start by identifying a platform you use regularly that lacks a developer API. Combine Playwright for the browser layer, Gemini for any creative or AI-intensive steps, and OpenClaw's modular skill system to keep everything composable. The stack is there — go build something worth bookmarking.


Source: @wlzh on X/Twitter Tags: OpenClaw Playwright Gemini API Podcast Automation YouTube Browser Automation AI Tools Xiaoyuzhou

Tags

#automation#AI#Gemini#YouTube#podcast#Playwright

Related Articles