APOSTLE
arrow_back Google's AI Creative Suite
Module 03 Veo 3 Video Generation

Prompt Structure, Image-to-Video, and Dialogue Scenes

Learn to generate cinematic video with synchronized dialogue, sound effects, and music using Veo 3.1, including image-to-video workflows and the Ingredients system for multi-shot consistency.

schedule 15 min
signal_cellular_alt Intermediate
menu_book Lesson 03 of 5

Learning Objectives

By the end of this module, you will be able to:

  • Generate video clips with synchronized dialogue and sound effects using Veo 3.1
  • Use image-to-video generation with Nano Banana Pro keyframes as inputs
  • Control camera movement, duration, and aspect ratio
  • Apply the Ingredients system for character and style consistency across clips
  • Direct dialogue scenes with natural lip synchronization
  • Choose between Veo 3.1 and Veo 3 based on project requirements

What Veo 3.1 Brings to the Table

Veo 3.1 is Google DeepMind's flagship video generation model, released in late 2025 as an evolution of Veo 3. It produces the most cinematic-looking AI video available from any provider, with a defining feature that changed the industry: native synchronized audio.

Before Veo 3, all AI video was silent. You'd generate clips, then manually add sound in post-production, trying to match audio to visuals after the fact. Veo 3 introduced synchronized audio generation — dialogue, sound effects, and ambient sound created WITH the video, in perfect sync. Veo 3.1 refined this with better voice quality, more precise lip sync, and richer environmental audio.

Veo 3.1's key capabilities:

  • Native audio-visual generation — dialogue with accurate lip sync, environment-appropriate SFX, and ambient soundscapes
  • Up to 8 seconds at high quality per clip
  • 1080p native resolution (4K via upscaling)
  • Ingredients system — upload character and style references for multi-shot consistency
  • Image-to-video — animate Nano Banana Pro keyframes
  • Camera control via natural language descriptions
  • Physics understanding — realistic water, fabric, smoke, and object interactions

Accessing Veo 3.1

Via Google Flow (Most Features)

Google Flow at flow.google.com is Veo 3.1's native environment, designed for multi-scene video production. Available to Google One AI Premium subscribers.

Flow provides a visual interface for managing multiple scenes, uploading ingredients, previewing clips, and assembling sequences — it's the closest thing to an AI video editing suite.

Via Gemini API (Programmatic)

from google import genai
from google.genai import types
import time

client = genai.Client(api_key="YOUR_KEY")

# Text-to-video generation
operation = client.models.generate_videos(
    model="veo-3.1",
    prompt="""A slow dolly shot through a sunlit greenhouse filled with
    tropical plants. Warm humid atmosphere, water droplets on leaves
    catching light. Sound of gentle rainfall on glass roof, distant
    tropical birdsong. Cinematic, dreamy, 35mm film look.""",
    config=types.GenerateVideoConfig(
        aspect_ratio="16:9",
        duration_seconds=6,
        number_of_videos=2,
        enhance_prompt=True,    # Let model expand your prompt
    )
)

# Wait for generation (async operation)
while not operation.done:
    time.sleep(10)
    operation = client.operations.get(operation)

# Download generated videos
for i, video in enumerate(operation.result.generated_videos):
    client.files.download(file=video.video, download_path=f"clip_{i}.mp4")
    print(f"Clip {i}: {video.video.uri}")

Via Google AI Studio (Experimentation)

Available in aistudio.google.com under the "Generate Video" section. Good for testing prompts before committing to API calls.


The Four-Part Prompt Structure

Google's official guidance recommends structuring Veo 3.1 prompts with four elements, each adding a layer of direction:

[1. Camera & Framing] + [2. Scene & Environment] +
[3. Subject & Action] + [4. Atmosphere, Audio & Style]

Example: Café Dialogue Scene

1. CAMERA: A medium close-up, slowly pushing in on two people
   at a café table.

2. SCENE: A cozy Parisian café with warm wooden paneling, soft
   afternoon light filtering through lace curtains, espresso
   cups and a small vase of flowers on the marble table.

3. SUBJECT: A woman in her 30s with dark hair leans forward,
   speaking animatedly. She says: "I think we should go for it.
   What's the worst that can happen?" The man across from her
   smiles and replies: "Famous last words."

4. ATMOSPHERE: Warm, intimate lighting. Background café sounds —
   quiet conversation, clinking cups, soft jazz from unseen
   speakers. The tone is hopeful and slightly playful. Shot
   on 50mm lens, shallow depth of field, film grain.

The model processes all four layers to generate a cohesive clip with accurate lip sync on the dialogue, appropriate background audio, and the directed camera movement.

Camera Movement Keywords That Work

Veo 3.1 responds to natural language camera direction. These are the movements that produce the most reliable results:

Reliable camera movements:
├── "Slow dolly in/out"         — smooth forward/back
├── "Tracking shot left/right"  — lateral following
├── "Crane up/down"             — vertical sweeping rise/fall
├── "Static locked-off shot"    — no movement (explicit!)
├── "Handheld with subtle movement" — organic, documentary feel
├── "Slow orbit around subject" — circular movement
├── "Push in to close-up"       — transition from wide to tight
├── "Pull back to reveal"       — start tight, reveal environment
└── "Slow zoom" (in or out)     — lens zoom vs physical movement

Less reliable (be cautious):
├── "Whip pan"                  — can cause artifacts
├── "Rack focus"                — inconsistent execution
└── "Steadicam following"       — sometimes misinterpreted

Pro tip: If you want NO camera movement, you MUST state it explicitly: "Static camera, tripod-locked, no movement." Otherwise Veo will add subtle drift by default.


Image-to-Video: Animating Nano Banana Pro Keyframes

This is where the Google stack's interconnection shines. A Nano Banana Pro image becomes the first frame of a Veo 3.1 video.

Workflow

Step 1: Generate keyframe in Nano Banana Pro
        (character + environment + composition)
              ↓
Step 2: Feed keyframe to Veo 3.1 as start image
        + motion prompt describing what happens
              ↓
Step 3: Veo generates video starting from your exact frame

API Example

from PIL import Image

# Your prepared keyframe from Nano Banana Pro
keyframe = Image.open("kitchen-scene-keyframe.png")

operation = client.models.generate_videos(
    model="veo-3.1",
    image=keyframe,
    prompt="""Starting from this exact frame: the woman slowly reaches
    for the coffee mug, lifting it gently. She brings it to her lips
    and takes a sip, eyes closing momentarily in satisfaction. The
    steam from the mug catches the morning light. Subtle camera push
    in. Sound of ceramic on marble as she lifts the mug, quiet morning
    ambiance, birdsong outside the window.""",
    config=types.GenerateVideoConfig(
        aspect_ratio="16:9",
        duration_seconds=5,
        number_of_videos=3,
    )
)

The keyframe locks down the visual foundation — character appearance, environment, lighting, composition. Veo animates FROM that foundation rather than inventing one from scratch. This is dramatically more controllable than pure text-to-video.

Start Frame + End Frame (via Flow)

In Google Flow, you can provide BOTH a start frame and an end frame. Veo interpolates the motion between them.

This is incredibly powerful for specific transitions:

  • Start: Woman facing the window, looking out
  • End: Same woman, turned toward camera, smiling
  • Motion: Natural turn with emotional transition

The model fills the gap with physically plausible motion while maintaining the character from both reference frames.


Directing Dialogue Scenes

Veo 3.1's native dialogue generation is its most distinctive capability. Here's how to maximize quality.

Writing Dialogue Prompts

Include spoken lines in quotation marks within the prompt:

A woman in a bright kitchen speaks directly to camera in a warm,
conversational tone. She holds up a jar of face cream and says:
"I've been using this for three weeks now, and honestly? My skin
has never felt this good." She smiles genuinely, glancing at the
product. The kitchen is bright and airy with natural light. Sound
of her voice clearly, light kitchen ambient in background.

Dialogue Quality Tips

  • Keep dialogue under 15 words per clip for best clarity and lip sync accuracy
  • One speaker per clip — multi-person dialogue in a single generation can cause lip sync confusion
  • Specify the vocal tone — "warm and conversational," "professional and confident," "excited and energetic"
  • Include pauses — "She pauses, then says..." produces more natural delivery
  • For precise voice matching, pre-generate the dialogue in ElevenLabs and use it as an audio reference in Flow (the model will lip-sync to your provided audio track)

When to Use Native Dialogue vs. Post-Synced Audio

Scenario Best Approach Why
UGC-style talking head Native Veo audio Authentic, conversational, matches casual visual style
Brand spokesperson with specific voice ElevenLabs + post-sync Voice brand consistency requires exact control
Short testimonial (under 10 words) Native Veo audio Quick, natural, sufficient quality
Long narration or voiceover ElevenLabs Native audio quality degrades over longer duration
Background conversation / crowd Native Veo audio Perfectly suited to ambient speech
Multi-language adaptation ElevenLabs Same video, different language tracks

The Ingredients System: Multi-Shot Consistency

Veo 3.1's Ingredients system (accessed primarily through Google Flow) lets you upload reference materials that persist across multiple video generations — the key to maintaining visual consistency in multi-shot projects.

How Ingredients Work

Upload to Flow's Ingredients panel:
├── Character references (headshots, full body)
├── Style references (mood boards, color palettes)
├── Environment references (location photos)
├── Object references (products, props)
└── Audio references (voice samples, music samples)

Once uploaded, these ingredients are available for every scene in your Flow project. When you generate a new clip, you can drag specific ingredients into the generation prompt. Veo uses them to anchor character identity, visual style, and environmental continuity.

Character Ingredient Best Practices

  • Upload 3-5 images of the character from different angles
  • Include at least one neutral-expression front-facing headshot
  • Ensure consistent lighting across reference photos (avoid mixed indoor/outdoor)
  • Include a full-body reference if body proportions matter
  • For wardrobe changes, upload separate outfit references

Style Ingredient Best Practices

  • Upload 2-3 images that define your color palette and mood
  • Include one image that represents your ideal lighting setup
  • Add a film stock or color grading reference if you have a specific look
  • The model blends style ingredients, so consistency across references matters

Practical Exercise

Exercise: Create a 3-Shot Nano Banana Pro → Veo 3.1 Sequence

Build a simple 3-shot sequence using the complete Google stack:

Shot 1 — Establishing:

  1. In Nano Banana Pro: Generate a wide shot of a peaceful outdoor café terrace at morning time
  2. In Veo 3.1: Animate it with a slow push-in camera movement, ambient sounds of morning birds and distant traffic (5 seconds)

Shot 2 — Character Introduction:

  1. In Nano Banana Pro: Compose a medium shot of a character sitting at a table in the same café (upload the establishing shot as environment reference)
  2. In Veo 3.1: Animate — the character picks up a coffee cup and takes a sip, gentle smile (4 seconds)

Shot 3 — Dialogue:

  1. In Veo 3.1: Generate the same character speaking to camera: "This is exactly how every morning should start." Warm, conversational (4 seconds)

Evaluate the three clips for character consistency, visual continuity, and audio quality. Where does consistency hold? Where does it break?


Key Takeaways

  • Veo 3.1 generates video with native synchronized audio — dialogue, SFX, and ambient sound, all in sync with visuals.
  • Use the four-part prompt structure: Camera → Scene → Subject/Action → Atmosphere/Audio/Style.
  • Image-to-video with Nano Banana Pro keyframes gives you precise control over the starting composition, character, and environment.
  • Dialogue works best under 15 words per clip, one speaker at a time, with specified vocal tone.
  • The Ingredients system in Flow maintains character and style consistency across multi-shot projects.
  • Native audio is best for ambient/SFX and short dialogue. Use ElevenLabs for brand voices and long narration.

References & Resources

Copied to clipboard