How to Stop AI Video Characters From Changing Between Shots (Reference Image Guide)

You have a script. You generate clip one with Veo. The character is a woman in her thirties, brown hair, navy blazer. You generate clip two with the same prompt. Now she has black hair, a different jacket, a slightly different face. By clip five you have five different people.

This is character drift. It is the single most visible tell of AI-generated video, and it is the reason a lot of perfectly good content gets dismissed as slop in May 2026.

The fix is not a better prompt. It is reference images.

Why AI video characters change between shots

Every text-to-video model in 2026 generates each clip independently. There is no memory of "the character we just made in the last clip." When you say "professional woman in a navy blazer," the model interprets that fresh every time, and there are millions of professional women in navy blazers in its training data.

Even within the same generation session, the model has no anchor. Veo 3.1, Kling 3.0, and Sora 2 (before its shutdown) all face the same problem when you generate sequential shots from text alone.

Some models added "multi-shot" features in early 2026 that try to maintain consistency within a single 15-second output. Kling 3.0 was the first to do this well. But even multi-shot only helps within one generation. The moment you start a new clip, drift is back.

The only reliable fix is to give the model a visual anchor. That anchor is a reference image.

How reference images actually work

A reference image tells the model "use this face, this body, this style as the starting point for whatever I describe next." Instead of inventing the character from text, the model adapts the reference to fit your prompt.

You can reference different things:

Character reference. A specific person, face, or avatar.
Environment reference. A location, room, neighborhood, or backdrop.
Style reference. A look, palette, lighting mood, or art direction.
Object or product reference. A specific item, logo, or piece of branding.

Most consumer AI video tools support character references in 2026. Few support all four. Compledio, Runway, and Veo's image-to-video flow handle multiple reference types simultaneously.

How to build a reference set that actually works

The reference image is not just any photo. A bad reference produces inconsistent output even with the technique applied correctly.

What makes a good character reference

Front-facing, neutral expression. Three-quarter angles introduce ambiguity. The model has to guess what the rest of the face looks like.
Even lighting. Heavy shadows confuse the feature extractor. Aim for soft, frontal light or daylight.
Visible features. Hair fully shown, no hat or sunglasses unless that's part of the character permanently. The model locks onto whatever it can see.
2K or higher resolution. Low-res references produce low-fidelity matches.
No other people in frame. The model can pick the wrong subject.

The fastest way to make a clean character reference is to generate one with an image model (Nano Banana Pro, Imagen, Midjourney) using a clean studio prompt, then use that as your reference for video generation.

What makes a good environment reference

A wide angle shot showing the space, not a tight detail.
Consistent lighting (don't mix golden hour and overhead fluorescent in one set).
Empty of people if you want to drop your character in.

What makes a good style reference

Consistent palette across the reference (don't mix vibrant and muted)
Visible texture and grain quality
Clear lighting direction

The workflow inside Compledio

Here is how reference images integrate into the audio-to-video pipeline.

1. Prepare your references

Before you upload audio, gather your references. For a typical branded video you would want:

One character reference (the host or recurring person)
One or two environment references (office, location, etc.)
One style reference (mood, lighting, palette)

Store them somewhere accessible. JPEG or PNG, 2K resolution if possible.

2. Upload audio and references together

Drop your audio in. Then add the reference images in the project settings, tagging each one (character, environment, style).

Compledio applies the references during prompt generation. Every B-roll prompt now includes "character resembles reference 1, environment matches reference 2" instructions.

3. Generate and review

Image generation runs first. Each generated still uses the references. You see all 20 to 50 images before any video gets made. This is the cheap stage to catch drift, because regenerating a still costs almost nothing.

If a clip looks off, regenerate just that one. Often a small prompt tweak fixes it without changing the references.

4. Animate with reference locked

The image-to-video step uses your reference-locked stills as starting frames. The motion gets added on top, but the character and environment stay anchored.

The result is a video where clip 1 and clip 17 look like they were shot on the same day with the same person.

Tools that handle reference images well in May 2026

| Tool | Character | Environment | Style | Multiple refs | |------|-----------|-------------|-------|---------------| | Compledio | Yes | Yes | Yes | Yes (up to 5) | | Runway Gen-4 | Yes | Yes | Limited | Yes (up to 3) | | Veo 3.1 (via Flow) | Yes | Limited | No | One at a time | | Kling 3.0 | Yes | No | No | One at a time | | Pika 2.0 | Yes | No | Yes | Yes (up to 2) | | HeyGen | Avatar only | No | No | No |

The frontier tools all support character references. The tools that handle character + environment + style simultaneously are still rare, which is why "the same character in different scenes" is still an edge most products fail at.

Common reference image mistakes

These are the patterns we see fail most often.

Using a low-resolution reference. A 512x512 source image gives you a blurry character. Use 1024x1024 minimum, 2048x2048 ideal.

Mixing reference styles. A photographic character reference plus an illustrated style reference confuses the model. Pick one direction and stay in it.

Too many references at once. Five character references makes the model average them. You get a generic-looking person who matches none of them. One per role is the sweet spot.

Expecting the reference to handle pose changes. A reference image of someone sitting will bias every generation toward sitting. If you need varied poses, generate a few reference images in different poses or note pose explicitly in each prompt.

Forgetting to refresh references for major scene changes. If the script moves from "office" to "outdoors at night," update the environment reference. The character reference stays, the environment one changes.

When reference images are not enough

Even with perfect references, AI video has limits. Reference images cannot fix:

Multi-person interactions. Two characters talking to each other still cause one of them to drift unless you reference both and give the model a clear interaction prompt.
Long temporal sequences. A 30-second continuous shot of one person doing one thing is still hard. Break it into multiple shots, each with the same reference.
Specific micro-expressions. A reference can lock the face. It cannot reliably hit "subtle smirk in the third clip when the punchline lands."

For these, hybrid editing (AI for visuals, real footage for the moments that need precision) is still the right move.

TL;DR

Character drift happens because text-to-video models generate each clip independently with no memory.
Reference images give the model a visual anchor and dramatically reduce drift.
A good character reference is front-facing, well-lit, high-resolution, and shows the full features you want to lock.
Use separate references for character, environment, and style. Don't pile them all into one prompt.
In May 2026, Compledio, Runway, and Veo handle the most reference types. Most other tools support character only.
Reference images solve consistency. They don't solve every AI video limitation. Hybrid editing still has its place.

The line between "AI slop" and "professionally produced" is mostly visual coherence across shots. Reference images are how you get that coherence without filming a thing.