Building a 30-Second Product Ad with Remotion, Gemini AI, and OpenClaw Skills

A developer's overnight case study in AI-powered video production automation

Introduction: When AI Meets Video Production

What if you could go from a product brief to a polished 30-second advertisement — complete with narration, synced footage, and professional pacing — without touching a video editor? That's exactly what developer @servasyy_ai pulled off in a single overnight session, using a stack of OpenClaw skills, Remotion, and Google's Gemini model.

The result is a repeatable, developer-friendly workflow that collapses what traditionally takes a video production team days into an automated pipeline anyone can run. This post breaks down how it works, why each component was chosen, and how you can adapt this approach for your own product marketing needs.

The Stack: Why Remotion, Gemini, and OpenClaw Skills?

Before diving into the workflow, it's worth understanding why this particular combination works so well together.

Remotion is a React-based framework for creating videos programmatically. Instead of dragging clips on a timeline, you write components — which means your video logic is version-controlled, composable, and automatable. For developers, this is a natural fit.

Gemini (Google's multimodal AI model) was chosen specifically for its creative and aesthetic judgment. As @servasyy_ai noted, Gemini's sense of visual composition and narrative structure is particularly strong for planning video content. When you ask it to design a 30-second video script with scene breakdowns, timing cues, and visual direction, the output is structured and production-ready.

OpenClaw Skills act as the glue — modular, callable units that handle discrete tasks like media retrieval, text-to-speech generation, and video assembly. Think of them as the API layer of your automation pipeline.

The combination is powerful because each tool operates in its own lane:

Gemini thinks creatively
OpenClaw skills execute operationally
Remotion renders programmatically

The Workflow: Step-by-Step Breakdown

Here's the full pipeline @servasyy_ai ran, reconstructed from the shared process:

Step 1 — AI Creative Direction with Gemini

The first call goes to Gemini with a prompt describing the product, its target audience, and the intended emotional tone of the ad. Gemini returns a structured 30-second video plan, including:

Scene-by-scene breakdown (e.g., 0–5s: product reveal, 5–12s: feature highlight, etc.)
Suggested visual style per scene
Voiceover copy for each segment
Pacing notes and transition suggestions

This isn't just a script — it's a production blueprint. Having timing and copy co-generated means every downstream step can reference a single source of truth.

{
  "scenes": [
    { "start": 0, "end": 5, "copy": "Introducing the tool that changes everything.", "visual": "hero product shot, slow zoom" },
    { "start": 5, "end": 14, "copy": "Built for speed. Designed for precision.", "visual": "feature demo, clean UI screen recording" },
    { "start": 14, "end": 24, "copy": "Trusted by teams who move fast.", "visual": "social proof montage" },
    { "start": 24, "end": 30, "copy": "Start free today.", "visual": "CTA card with logo" }
  ]
}

Step 2 — Targeted Media Retrieval with `media-download` Skill

With the scene plan in hand, the next step is sourcing footage. This is where keyword optimization matters critically. Raw scene descriptions like "hero product shot" are too vague for stock media APIs to return precise results.

The workflow adds a copy-optimization pass — either a secondary Gemini prompt or a lightweight transform — that converts scene descriptions into tight, search-optimized keyword strings:

"hero product shot, slow zoom"  →  "product reveal close-up technology minimal"
"social proof montage"          →  "team collaboration office smiling professionals"

The media-download OpenClaw skill then fires against these refined keywords, pulling matching clips from configured stock video sources. Each clip is tagged with its scene index so the assembly step knows what goes where.

Key insight: The quality of your media matches is almost entirely determined by keyword precision at this stage. Investing a prompt call in keyword refinement pays off significantly in clip relevance.

Step 3 — Voiceover Generation with the TTS Skill

In parallel with media retrieval, the voiceover copy from Step 1 is passed to the TTS (Text-to-Speech) OpenClaw skill. This generates an audio file for each scene segment.

Running TTS in parallel with the media download step is a meaningful efficiency gain — neither depends on the other, so there's no reason to sequence them.

The TTS skill returns:

Audio files per scene (or a single concatenated track)
Duration metadata per segment — this is the critical output for Step 4

# Conceptual skill invocation
openclaw run tts-skill \
  --input scenes.json \
  --voice "professional-neutral" \
  --output ./audio/

The duration data drives everything downstream. If Gemini planned 5 seconds for a scene but the TTS audio runs 6.2 seconds, the video must adapt — not the audio.

Step 4 — Audio-Driven Clip Editing and Assembly in Remotion

This is where the pipeline becomes genuinely impressive. Rather than cutting clips to a fixed timeline, the Remotion composition is built dynamically around the actual TTS audio durations.

Each scene component in Remotion receives:

The downloaded clip for that scene
The TTS audio segment
The audio duration as the authoritative length for that scene

Remotion's <Sequence> and <Audio> primitives make this straightforward:

// Simplified Remotion composition
const AdVideo: React.FC = () => {
  return (
    <AbsoluteFill>
      {scenes.map((scene, i) => (
        <Sequence key={i} from={scene.startFrame} durationInFrames={scene.durationFrames}>
          <VideoClip src={scene.clipPath} />
          <Audio src={scene.audioPath} />
          <CaptionOverlay text={scene.copy} />
        </Sequence>
      ))}
    </AbsoluteFill>
  );
};

The total video duration is the sum of all TTS segment durations, computed before the Remotion render begins. This means the video runtime is always perfectly synced to the narration — no manual trimming, no silent gaps.

Why This Workflow Matters for Developers

This isn't just a clever demo. It points toward a broader shift in how development teams can approach content production:

Repeatable: The entire pipeline is code. Run it again with a different product brief and you get a different ad.
Scalable: Need 10 ad variations for A/B testing? Parameterize the Gemini prompt and run it in a loop.
Cost-effective: Replacing even a portion of a video production workflow with this pipeline can dramatically reduce turnaround time and cost.
Auditable: Because every step is logged and the inputs/outputs are structured data, you can debug and improve individual stages without rebuilding from scratch.

The use of parallel skill execution — running media download and TTS simultaneously — is also worth noting as a design principle. In any multi-step AI pipeline, identifying which steps are independent and running them concurrently is a straightforward optimization that compounds across longer workflows.

Conclusion: The Overnight Build as a Template

@servasyy_ai built this in one night, which says less about working hours and more about how far the tooling has come. A year ago, this workflow would have required custom integrations at every seam. Today, with OpenClaw skills handling the operational layer, Gemini handling creative direction, and Remotion handling programmatic rendering, the architecture is clean enough to prototype quickly and robust enough to ship.

If you're building developer tools, SaaS products, or any software with a marketing surface, this pipeline is worth stealing. The components are modular — swap Gemini for another model, replace the stock media skill with your own asset library, or extend the Remotion composition with brand-specific templates.

The blueprint is there. The stack is available. The only input you need is a product worth showing off.

Reference: Original workflow by @servasyy_ai Published on ClawList.io — developer resources for AI automation and OpenClaw skills.

Building Product Ads with Remotion and Gemini

Building a 30-Second Product Ad with Remotion, Gemini AI, and OpenClaw Skills

Introduction: When AI Meets Video Production

The Stack: Why Remotion, Gemini, and OpenClaw Skills?

The Workflow: Step-by-Step Breakdown

Step 1 — AI Creative Direction with Gemini

Step 2 — Targeted Media Retrieval with `media-download` Skill

Step 3 — Voiceover Generation with the TTS Skill

Step 4 — Audio-Driven Clip Editing and Assembly in Remotion

Why This Workflow Matters for Developers

Conclusion: The Overnight Build as a Template

Send this page to someone who needs it

Tags

Related Skills

UniVision Engine

Multi-Agent Collaboration

Xiaohongshu Skills

Related Articles

AI Short Video Factory

Video Editing with Media Downloader and Remotion

One-Click Paper Analysis with Skills Using Claude

Building a 30-Second Product Ad with Remotion, Gemini AI, and OpenClaw Skills

Introduction: When AI Meets Video Production

The Stack: Why Remotion, Gemini, and OpenClaw Skills?

The Workflow: Step-by-Step Breakdown

Step 1 — AI Creative Direction with Gemini

Step 2 — Targeted Media Retrieval with media-download Skill

Step 3 — Voiceover Generation with the TTS Skill

Step 4 — Audio-Driven Clip Editing and Assembly in Remotion

Why This Workflow Matters for Developers

Conclusion: The Overnight Build as a Template

Send this page to someone who needs it

Tags

Related Skills

UniVision Engine

Multi-Agent Collaboration

Xiaohongshu Skills

Related Articles

AI Short Video Factory

Video Editing with Media Downloader and Remotion

One-Click Paper Analysis with Skills Using Claude

Step 2 — Targeted Media Retrieval with `media-download` Skill