NeoBabel: A Multilingual Open Tower for Visual Generation

Plot showing NeoBabel's performance compared to other models.

NeoBabel establishes a new Pareto frontier in multilingual image generation performance, efficiency, and inclusivity. It matches state-of-the-art models on English benchmarks while being 2-4x smaller, and significantly outperforms them on multilingual tasks.

Abstract

Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NEOBABEL, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NEOBABEL achieves state-of-the-art multilingual performance while retaining strong English capability. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research.

NeoBabel Architecture

Overview of the NeoBabel architecture.

NeoBabel builds upon a pretrained multilingual LLM (Gemma-2) and introduces two key modifications for visual generation:

Unified Multimodal Embedding: We extend the LLM's vocabulary with 8,192 new embeddings for discrete image tokens. This allows the model to process both text and image tokens natively within a shared semantic space, eliminating the need for separate encoders.
Modality-Aware Attention: A hybrid attention mechanism is used. Text tokens use causal attention to preserve autoregressive language capabilities, while image tokens use full bidirectional attention for rich, high-fidelity image synthesis.
Unified Task Representation: All tasks, such as text-to-image generation and inpainting, are treated as a single autoregressive sequence prediction problem, simplifying the training pipeline.

NeoBabel Multilingual Datasets

A primary challenge in multilingual generation is the scarcity of high-quality, culturally annotated visual-linguistic data. To address this, we developed a comprehensive data curation pipeline to expand existing English-only datasets into six target languages: English, Chinese, Dutch, French, Hindi, and Persian. Our process involves:

Recaptioning: Generating detailed, high-quality English captions for images using a powerful vision-language model (InternVL).
Quality Filtering: Applying a multi-step filtering process to ensure captions are of appropriate length, are in the correct language, align with the visual content, and are free of toxic or NSFW material.
Translation: Translating the high-quality English captions into the five other target languages using state-of-the-art translation models (NLLB and Gemini).

This pipeline expanded our total dataset from 39 million to 124 million multilingual text-image pairs.

Original Dataset	Image Source	Original Size	Expansion Method	New Size (6 Langs)
ImageNet 1K	Web	1M	Translation	6M
CC12M	Web	12M	Recaptioning	12M (Eng only)
SA-1B	Photography	10M	Recaptioning	10M (Eng only)
LAION-Aesthetic	Web	12M	Recaption + Translation	72M
JourneyDB	Synthetic	4M	Recaption + Translation	24M
BLIP3-o Instruct	Web + Synthetic	60K	Translation	360K
Total	-	39M	-	124M

Expansion of datasets for multilingual training.

NeoBabel Training Stages

Graph showing performance improvement across training stages.

Performance on m-GenEval and m-DPG improves steadily across pretraining and instruction tuning stages.

NeoBabel is trained using a staged learning framework that progressively builds its capabilities:

Progressive Pretraining (at 256x256):

Stage 1 - Pixel Dependency Learning: Training on ImageNet with class labels to learn foundational visual representations.
Stage 2 - Scaling Alignment: Fine-tuning on 94M image-caption pairs to ground the model in natural language.
Stage 3 - Refined Multilingual Pretraining: Final pretraining on 96M high-quality pairs to improve aesthetic generation.

Progressive Instruction Tuning (at 512x512):

Stage 1 - Initial Alignment: Training on a mix of aesthetic and instruction-following data to build high-resolution capabilities.
Stage 2 - Instruction Refinement: Shifting the data mix to emphasize complex, instruction-rich samples to refine the model's ability to follow commands.

Multilingual Evaluation

Existing image generation benchmarks are overwhelmingly English-centric. To properly assess NeoBabel, we created a comprehensive multilingual evaluation suite.

Multilingual Benchmarks: We extended two popular English-only benchmarks by translating their prompts into our five other target languages, creating m-GenEval (for compositional reasoning) and m-DPG (for general-purpose generation).
New Evaluation Metrics: We introduced two novel metrics to measure multilingual consistency:
- Cross-Lingual Consistency (CLC): Measures how visually similar the generated images are for the same prompt across different languages. A high CLC score indicates the model has a consistent internal representation of concepts, regardless of language.
- Code-Switching Similarity (CSS): Assesses the model's robustness to prompts that mix multiple languages within a single sentence, a common real-world scenario.

This new suite allows for a more rigorous and realistic evaluation of a model's true multilingual capabilities.

Qualitative Results

Multilingual Generation

NeoBabel produces semantically accurate and visually cohesive outputs for the same concept across all six supported languages. Below are examples where the English prompt was translated, and one image was generated for each language.

Multilingual Inpainting & Extrapolation

The model can edit and extend images using prompts in various languages without additional fine-tuning.

Inpainting a chair to look like a strawberry in Chinese.

Inpainting across languages.

Extrapolating a futuristic car in a city of mirrors using English and Hindi prompts.

Extrapolation across languages.

Cross-Lingual (Code-Switched) Prompt Generation

A challenging test where a single prompt contains phrases from multiple languages. NeoBabel successfully integrates these multilingual instructions into a single, coherent image.

Examples of images generated from code-switched prompts mixing three languages.

Generated from prompts mixing (Top) English, Dutch, French (Bottom) Hindi, Persian, Chinese.

Quantitative Results

NeoBabel achieves state-of-the-art results on both English and multilingual benchmarks, despite being 2-4x smaller than competing models.

Multilingual m-GenEval Benchmark

Figure 3 from the paper: NeoBabel matches the SOTA English score and outperforms on all multilingual cases.

Multilingual m-DPG Benchmark

Method	Params.	English	Chinese	Dutch	French	Hindi	Persian	Overall
Show-o	1.3B	0.67	0.10	0.22	0.32	0.04	0.04	0.23
BLIP3-o	4B	0.79	0.60	0.58	0.59	0.47	0.49	0.58
Janus Pro	7B	0.84	0.50	0.61	0.68	0.12	0.12	0.47
BLIP3-o	8B	0.80	0.56	0.59	0.61	0.50	0.53	0.59
NeoBabel (Ours)	2B	0.75	0.70	0.69	0.70	0.63	0.65	0.68

Table 3 from the paper: NeoBabel significantly outperforms all baselines across non-English languages.

BibTeX

@article{derakhshani2025neobabel,
  title={NeoBabel: A Multilingual Open Tower for Visual Generation},
  author={Derakhshani, Mohammad Mahdi and Varghese, Dheeraj and Fadaee, Marzieh and Snoek, Cees GM},
  journal={arXiv preprint arXiv:2507.06137},
  year={2025}
}