Logo of VIDEOAI.ME
VIDEOAI.ME

Kling AI Image-to-Video Prompts: 15 Copy-Paste Templates That Ship

Video Ads··10 min read·Updated Apr 12, 2026

Image-to-video is the most reliable Kling AI workflow for production ads. Here are 15 tested prompt templates with Kling 3.0 multi-shot sequences, action beat timing, and the reference image rules that cut rerolls by 70 percent.

Kling AI image-to-video prompt example showing reference frame transformed into animated video

Why Image-To-Video Is the Production Default

After generating over 3,000 Kling AI clips for paid ad campaigns, I can say this with confidence: image-to-video is the workflow that ships. Text-to-video is useful for exploration and b-roll, but when a client needs a specific actor holding a specific product in a specific setting, image-to-video cuts rerolls by roughly 70 percent.

The reason is simple. You hand the model the look. The model only has to add motion.

Wyzowl's 2024 State of Video Marketing report found that 91 percent of businesses use video as a marketing tool and 88 percent of marketers say video gives them positive ROI. When your production depends on consistent, shippable output, image-to-video is the Kling AI workflow that delivers.

This guide covers the prompt structure, 15 copy-paste templates, Kling 3.0 multi-shot image-conditioned sequences, and the reference image rules I have learned from production.

When To Use Image-To-Video Instead Of Text-To-Video

Use image-to-video when:

  • Character continuity matters. UGC ads with the same actor across multiple clips
  • Product identity matters. You need a specific real product, not an AI approximation
  • Composition matters. You have a specific frame in mind
  • Brand palette matters. You want lighting and color to match existing assets
  • Budget matters. Fewer rerolls means fewer credits burned

Use text-to-video when:

  • You want creative exploration and do not care about exact visual match
  • There is no specific subject identity to preserve
  • You are generating abstract b-roll or stock footage

For most production ad work, image-to-video wins. For deeper prompt fundamentals, see the Kling AI prompt guide.

The Image-To-Video Prompt Structure

Image-to-video prompts are structurally different from text-to-video. The reference image already encodes appearance, lighting, and composition. Your prompt handles only:

  1. Camera move. One move per clip. Push-in, dolly, tracking, locked-off, or handheld drift.
  2. Action beats. Counted moments with timestamps. 0-1.5s: taps the lid. 1.5-3s: turns the jar.
  3. Ambient motion. What moves in the environment. Curtains, steam, hair, fabric.
  4. Dialogue (optional). One short line if the subject speaks.
  5. Negative prompt. Glitches to suppress.

Do NOT describe the subject's appearance, clothing, hair color, lighting setup, or color palette. The reference image handles all of that. Repeating it in text creates competing instructions and causes drift.

10 Image-To-Video Prompt Templates

1. UGC selfie with product.

Handheld vertical drift, slight natural motion. The subject from the reference image. 0-1.5s: taps the product lid with index finger. 1.5-3s: turns the product to show texture. 3-5s: looks at camera, says "this one actually works". Negative: jittery eyes, frozen lips, warping fingers.

2. Product hero rotation.

Locked-off, slow 35 degree rotation 0-5s. The product from the reference image, ambient light play across surface. Subtle shadow shift only. Negative: melted edges, mirrored text, deformed packaging.

3. Real estate interior drift.

Cinematic slow drift right 0-5s. The room from the reference image, gentle curtain motion and dust particles in light beam. No people. Negative: warping walls, floating furniture, bending doorframes.

4. Founder editorial.

Clean editorial 50mm, slow push-in over 5 seconds. The person from the reference image, in the same office setting. 0-2s: leans slightly forward. 2-4s: gestures with right hand. 4-5s: pauses, direct eye contact. Dialogue: "This is the system we built." Negative: jittery eyes, frozen lips, double face.

5. Character turn with hair motion.

Locked composition, slight handheld drift. The character from the reference image rotates 30 degrees clockwise over 5 seconds. Hair and fabric move with natural physics. Negative: warping body, jittery eyes, anatomy drift.

6. E-commerce lifestyle context.

Slow dolly-in, 50mm equivalent. The product from the reference image placed on a kitchen counter with morning light. Steam from a coffee cup drifts behind. 0-5s gentle approach. Negative: warping product, mirrored text, floating objects.

7. Fitness action beat.

Handheld vertical, slight shake. The athlete from the reference image in the same gym setting. 0-1.5s: completes a rep. 1.5-3s: stands, catches breath. 3-5s: looks at camera, says "try this". Negative: warping limbs, frozen lips, jittery motion.

8. Fashion walk cycle.

Tracking shot at walking pace, slight handheld. The model from the reference image walking toward camera on the same street. 0-5s continuous walk, fabric and hair moving naturally. Negative: frozen legs, warping body, jittery background.

9. Food plating macro.

Macro close-up, locked-off with shallow depth shift. The dish from the reference image. 0-2s: steam rises from surface. 2-5s: slow push-in revealing texture. Negative: melted food, warping plate, distortion.

10. Testimonial two-shot.

Documentary 35mm, slight handheld drift. The two people from the reference image in the same setting. Person A (0-2s): gestures while speaking. Person B (2-4s): nods. Person A (4-5s): turns to camera. Negative: frozen lips, jittery eyes, face swap.

Kling 3.0 Multi-Shot Image-Conditioned Sequences

Kling 3.0 introduced multi-shot mode with up to 6 shots and 15 seconds total. Combined with image-to-video, this is the most powerful workflow for building complete ad sequences with a consistent character.

Here is a 4-shot UGC ad sequence using a single reference image.

Master Prompt:

Vertical 9:16 UGC ad, handheld feel, natural daylight. A woman in her late 20s (from reference image) in a sunlit kitchen. Product: glass jar of face cream. Warm cream and walnut palette throughout.

Multi-Shot Prompt 1 (0-3s) - Hook:

Close-up handheld, slight drift. She picks up the jar from the counter and holds it to camera. Expression: curious, slightly excited. Natural kitchen ambient sounds.

Multi-Shot Prompt 2 (3-6s) - Demo:

Medium close-up, locked composition. She unscrews the lid, dips a finger in, shows the texture to camera. Expression: impressed.

Multi-Shot Prompt 3 (6-9s) - Application:

Close-up of her face. She applies a small amount to her cheek, blends with fingertips. Soft smile. Natural bathroom mirror feel.

Multi-Shot Prompt 4 (9-12s) - CTA:

Medium shot, slight push-in. She holds the jar next to her face, direct eye contact. Says: "Link in bio. Trust me." Small smile at the end.

This produces a complete 12-second UGC ad from a single reference image with consistent character identity across all four shots.

Reference Image Quality Rules

The quality of the reference image determines the ceiling of the output. I have learned these rules from thousands of generations:

  • Minimum 1024px on the long edge. Higher is always better. Below 720px you get soft, muddy output.
  • Match aspect ratios. 9:16 reference for vertical output, 16:9 for landscape. Mismatched ratios force awkward cropping.
  • Clear subject separation from background. Busy backgrounds confuse the model about what to animate.
  • Consistent, natural lighting. Harsh shadows or mixed color temperatures create unpredictable animation.
  • No watermarks or text overlays. These animate in bizarre ways: warping, stretching, dissolving.
  • No heavy filters or extreme color grading. The model interprets these literally and they amplify during animation.

Statistics That Matter

Image-to-video is not just technically better. It is economically better.

  • HubSpot's 2024 marketing report found that short-form video has the highest ROI of any media format. Image-to-video with locked character references is the fastest path to producing these at scale.
  • Bazaarvoice research shows UGC-style content generates 29 percent higher web conversions than brand-produced content. AI-generated UGC at scale requires character consistency, which image-to-video delivers.
  • Our internal data shows image-to-video reduces per-clip production time from 25 minutes (text-to-video with multiple rerolls) to 8 minutes (image-to-video, typically first or second attempt).

5 More Image-To-Video Templates: Advanced Use Cases

11. Before-and-after skin transformation.

Locked composition, soft bathroom daylight. The person from the reference image. 0-2s: touches left cheek with fingertips. 2-4s: turns face 30 degrees to show the other side. 4-5s: looks at camera with satisfied expression. Negative: warping skin, jittery eyes, frozen lips, anatomy drift.

12. Jewellery try-on.

Macro close-up, slight handheld drift. The hand from the reference image wearing the ring. 0-2s: fingers spread slightly. 2-4s: hand rotates to catch light on the stone. 4-5s: fingers close, hand rests. Negative: warping fingers, melted ring, deformed stone.

13. Pet product reaction.

Handheld vertical, natural living room daylight. The dog from the reference image sniffing a treat bag. 0-2s: sniffs the bag. 2-4s: tilts head. 4-5s: tail wags in background. Negative: warping snout, extra legs, frozen tail.

14. Architectural exterior golden hour.

Cinematic slow pan left 0-5s. The building from the reference image at golden hour. Warm light rakes across the facade. Birds pass in the distance. Negative: warping structure, jittery sky, bending lines, distorted windows.

15. Tech product unboxing.

Top-down medium shot, slight push-in. Hands from the reference image lifting a device from a white box. 0-2s: lid lifts. 2-4s: device rises from packaging. 4-5s: device placed on surface, packaging pushed aside. Negative: warping hands, floating product, deformed packaging.

The Image-To-Video Workflow in Practice

Here is the exact workflow I use for a typical UGC ad campaign:

  1. Generate the reference image. Use Flux, Midjourney, or a real photo. Crop to the output aspect ratio (9:16 for vertical). Ensure resolution is at least 1024px on the long edge.
  2. Write the motion-only prompt. Camera move, 2-3 action beats with timestamps, one dialogue line, negative prompt. No appearance details.
  3. Generate on Kling 2.6 Pro for testing. Cheaper per generation. Dial in the timing and motion.
  4. Switch to Kling 3.0 for hero output. Better quality, native audio, character consistency across multi-shot.
  5. Batch process variants. Same reference image, different scripts and action beats. One reference can produce 20-30 unique clips.

This workflow produces 15-20 shippable clips per hour at a cost of $0.30-$0.80 per clip. Compare that to hiring a human UGC creator at $150-$500 per video.

Common Image-To-Video Mistakes

Re-describing the subject. If your reference shows a woman in a navy shirt, do not write "a woman in a navy shirt" in the prompt. The model gets confused by the redundancy.

Two camera moves. One move per clip. If you want a push-in and then a pan, that is two clips stitched in post, or two Kling 3.0 multi-shot segments.

Ignoring aspect ratio mismatch. A 16:9 reference image forced into 9:16 output produces cropped, awkward compositions.

Using low-resolution reference images. Anything under 720px produces noticeably soft output. Upscale your reference images before uploading.

Over-specifying lighting. If the reference image has warm window light, do not write "cool fluorescent lighting" in the prompt. The model tries to reconcile both and the result looks wrong.

Forgetting to include ambient motion. Static backgrounds read as frozen AI output. Always add one ambient element: curtain drift, steam, dust, hair movement.

For more image-to-video techniques, see Kling AI tips and tricks. For the complete prompt anatomy, start with the Kling AI prompt guide. To see these prompts in action across ad platforms, check Kling AI for TikTok ads and Kling AI for Meta ads.

Inside VIDEOAI.ME the image-to-video workflow is built into every generation mode. Upload a reference image, write a motion-only prompt, and ship.

Frequently Asked Questions

Share

AI Summary

Paul Grisel

Paul Grisel

Paul Grisel is the founder of VIDEOAI.ME, dedicated to empowering creators and entrepreneurs with innovative AI-powered video solutions.

@grsl_fr

Ready to Create Professional AI Videos?

Join thousands of entrepreneurs and creators who use Video AI ME to produce stunning videos in minutes, not hours.

  • Create professional videos in under 5 minutes
  • No video skills experience required, No camera needed
  • Hyper-realistic actors that look and sound like real people
Start Creating Now

Get your first video in minutes

Related Articles