Lance: ByteDance's 3B Unified Model for Image and Video Understanding, Generation, and Editing

ByteDance's Lance is a 3B-parameter unified multimodal model that handles image/video understanding, generation, and editing with competitive benchmarks.

ByteDance has open-sourced Lance, a 3B-active-parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single framework. Trained from scratch with a staged multi-task recipe on up to 128 A100 GPUs, Lance achieves competitive performance across multiple benchmarks despite its relatively small size.

Why Lance Matters

Most multimodal models today are either specialized (generation-only or understanding-only) or require massive parameter counts (7B–20B+) to unify capabilities. Lance demonstrates that a 3B model can match or exceed larger models across image generation, image editing, video generation, and video understanding tasks. This is significant for:

  • Deployment efficiency: Lower VRAM requirements (40GB+ GPU) and faster inference
  • Research accessibility: Smaller compute budget for training and fine-tuning
  • Unified architecture: Single model for multiple tasks without task-specific heads

Architecture and Training

Lance is a native unified model, meaning it uses a single architecture for both understanding and generation tasks. Key details:

  • Active parameters: 3B (not total parameters, but those activated during inference)
  • Training: Staged multi-task recipe from scratch
  • Compute: Up to 128 A100 GPUs
  • Resolution: Up to 768×768 for images, 480p at 12 FPS for video

Supported Tasks

Lance supports seven task types out of the box:

Task Description
t2i Text-to-Image generation
t2v Text-to-Video generation
i2v Image-to-Video generation
image_edit Image editing
video_edit Video editing
x2t_image Image understanding (captioning, VQA)
x2t_video Video understanding (captioning, QA)

Installation and Setup

Requirements

  • Python 3.10+
  • CUDA 12.4+
  • GPU with at least 40GB VRAM (tested on A100)

Quick Start

git clone https://github.com/bytedance/Lance.git
cd Lance

conda create -n Lance python=3.11 -y
conda activate Lance
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation

Download model weights from Hugging Face:

from huggingface_hub import snapshot_download

save_dir = "./downloads/"
repo_id = "bytedance-research/Lance"
cache_dir = save_dir + "/cache"

snapshot_download(
    cache_dir=cache_dir,
    local_dir=save_dir,
    repo_id=repo_id,
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt", "*.pth"]
)

Inference Examples

Text-to-Video

bash inference_lance.sh \
  --TASK_NAME t2v \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 121 \
  --VIDEO_HEIGHT 480 \
  --VIDEO_WIDTH 848 \
  --SAVE_PATH_GEN results/t2v

Image-to-Video (New!)

bash inference_lance.sh \
  --TASK_NAME i2v \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --NUM_FRAMES 61 \
  --VIDEO_HEIGHT 480 \
  --VIDEO_WIDTH 848 \
  --SAVE_PATH_GEN results/i2v

Text-to-Image

bash inference_lance.sh \
  --TASK_NAME t2i \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --VIDEO_HEIGHT 768 \
  --VIDEO_WIDTH 768 \
  --SAVE_PATH_GEN results/t2i

Video Editing

bash inference_lance.sh \
  --TASK_NAME video_edit \
  --MODEL_PATH downloads/Lance_3B_Video \
  --RESOLUTION video_480p \
  --SAVE_PATH_GEN results/video_edit

Image Understanding

bash inference_lance.sh \
  --TASK_NAME x2t_image \
  --MODEL_PATH downloads/Lance_3B \
  --RESOLUTION image_768res \
  --SAVE_PATH_GEN results/x2t_image

Key Parameters

Parameter Default Description
MODEL_PATH downloads/Lance_3B Path to weights (Lance_3B or Lance_3B_Video)
NUM_GPUS 1 Number of GPUs
VALIDATION_NUM_TIMESTEPS 30 Denoising steps
CFG_TEXT_SCALE 4.0 Classifier-Free Guidance scale
NUM_FRAMES 50 Max 121 for video
ENHANCE_PROMPT false Enable prompt rewriting (requires OpenAI API key)

Benchmark Performance

Lance achieves competitive results across multiple benchmarks:

Image Generation (GenEval)

  • Overall: 0.90 (matches TUNA-7B, beats Janus-Pro-7B's 0.80)
  • Colors: 0.97 (best among all models)
  • Position: 0.87 (ties with TUNA-7B)

Image Editing (GEdit-Bench)

  • Average: 7.30 (beats InternVL-U 1.7B's 6.66 and BAGEL's 6.52)

Video Generation (VBench)

  • Total Score: 85.11 (beats TUNA 1.5B's 84.06 and Hunyuan Video's 83.43)

Gradio Demo

Lance includes a local Gradio interface for interactive use:

python lance_gradio.py --server-name 0.0.0.0 --server-port 7860

This provides a web UI for all supported tasks.

Roadmap

The team plans to release fine-tuning code, enabling customization for specific domains or tasks.

Limitations

  • Research project, not a polished product
  • Output quality varies across prompts, resolutions, and motion complexity
  • Trained up to 768×768 images and 480p video
  • Requires 40GB+ VRAM GPU

Conclusion

Lance represents a significant step toward efficient unified multimodal models. At 3B active parameters, it demonstrates that smaller models can compete with much larger ones when trained with a well-designed multi-task recipe. For developers and researchers looking to experiment with unified image/video understanding and generation, Lance provides a practical, open-source starting point.

Get started: GitHub Repository | Hugging Face Model | Technical Report

Source

bytedance/Lance: A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing.