More details will be released soon! 🕒🔥
NeoBabel establishes a new Pareto frontier in multilingual image generation performance, efficiency, and inclusivity. It matches state-of-the-art models on English benchmarks while being 2-4x smaller, and significantly outperforms them on multilingual tasks.
Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NEOBABEL, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NEOBABEL achieves state-of-the-art multilingual performance while retaining strong English capability. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research.
Overview of the NeoBabel architecture.
NeoBabel builds upon a pretrained multilingual LLM (Gemma-2) and introduces two key modifications for visual generation:
A primary challenge in multilingual generation is the scarcity of high-quality, culturally annotated visual-linguistic data. To address this, we developed a comprehensive data curation pipeline to expand existing English-only datasets into six target languages: English, Chinese, Dutch, French, Hindi, and Persian. Our process involves:
This pipeline expanded our total dataset from 39 million to 124 million multilingual text-image pairs.
Original Dataset | Image Source | Original Size | Expansion Method | New Size (6 Langs) |
---|---|---|---|---|
ImageNet 1K | Web | 1M | Translation | 6M |
CC12M | Web | 12M | Recaptioning | 12M (Eng only) |
SA-1B | Photography | 10M | Recaptioning | 10M (Eng only) |
LAION-Aesthetic | Web | 12M | Recaption + Translation | 72M |
JourneyDB | Synthetic | 4M | Recaption + Translation | 24M |
BLIP3-o Instruct | Web + Synthetic | 60K | Translation | 360K |
Total | - | 39M | - | 124M |
Expansion of datasets for multilingual training.
Performance on m-GenEval and m-DPG improves steadily across pretraining and instruction tuning stages.
NeoBabel is trained using a staged learning framework that progressively builds its capabilities:
Progressive Pretraining (at 256x256):
Progressive Instruction Tuning (at 512x512):
Existing image generation benchmarks are overwhelmingly English-centric. To properly assess NeoBabel, we created a comprehensive multilingual evaluation suite.
This new suite allows for a more rigorous and realistic evaluation of a model's true multilingual capabilities.
NeoBabel produces semantically accurate and visually cohesive outputs for the same concept across all six supported languages. Below are examples where the English prompt was translated, and one image was generated for each language.
The model can edit and extend images using prompts in various languages without additional fine-tuning.
Inpainting across languages.
Extrapolation across languages.
A challenging test where a single prompt contains phrases from multiple languages. NeoBabel successfully integrates these multilingual instructions into a single, coherent image.
Generated from prompts mixing (Top) English, Dutch, French (Bottom) Hindi, Persian, Chinese.
NeoBabel achieves state-of-the-art results on both English and multilingual benchmarks, despite being 2-4x smaller than competing models.
Figure 3 from the paper: NeoBabel matches the SOTA English score and outperforms on all multilingual cases.
Method | Params. | English | Chinese | Dutch | French | Hindi | Persian | Overall |
---|---|---|---|---|---|---|---|---|
Show-o | 1.3B | 0.67 | 0.10 | 0.22 | 0.32 | 0.04 | 0.04 | 0.23 |
BLIP3-o | 4B | 0.79 | 0.60 | 0.58 | 0.59 | 0.47 | 0.49 | 0.58 |
Janus Pro | 7B | 0.84 | 0.50 | 0.61 | 0.68 | 0.12 | 0.12 | 0.47 |
BLIP3-o | 8B | 0.80 | 0.56 | 0.59 | 0.61 | 0.50 | 0.53 | 0.59 |
NeoBabel (Ours) | 2B | 0.75 | 0.70 | 0.69 | 0.70 | 0.63 | 0.65 | 0.68 |
Table 3 from the paper: NeoBabel significantly outperforms all baselines across non-English languages.
@misc{derakhshani2025neobabelmultilingualopentower,
title={NeoBabel: A Multilingual Open Tower for Visual Generation},
author={Mohammad Mahdi Derakhshani and Dheeraj Varghese and Marzieh Fadaee and Cees G. M. Snoek},
year={2025},
}