Developing a LLM-Driven Multi-Agent Framework for Multimodal Translation

Feiyang Jiang *, Kristina Wu *, Nick Jumaoas *, Hao Zhang
University of California, San Diego

2025 Data Science Senior Capstone

Motivation

The increasing global demand for translated content, particularly in multimedia formats like manga, comic book, and literary works, has highlighted significant challenges in maintaining both efficiency and quality in translation processes. Traditional manual translation workflows, which rely heavily on human translators, editors, and typesetters, often struggle with scalability and consistency when handling large volumes of content that combines both textual and visual elements. This challenge becomes particularly acute in the context of manga translation, where cultural nuances, visual context, and textual accuracy must be carefully balanced by skilled professionals, leading to substantial time investments and increased production costs.

Our goal is to combine recent advancements in multi-agent translation systems and context-aware multimodal translation, leveraging the increasing power of LLMs and VLMs to create a manga translation pipeline that has both convenience of machine translation and the quality of human translation. In doing so, we aim to demonstrate the potential of modular multi-agent frameworks to flexibly accommodate a wide variety of media, which we believe can facilitate the globalization of translated content at affordable costs, allowing for the enrichment of people all around the world.

Dataset

Our primary baseline dataset comes from OpenMantra (Hinami et al.), an automatic manga translation project that provides 214 manually translated and annotated manga pages for use as a benchmark for future research.

Image
OpenMantra text field annotations (excerpt), visualized

Notably, although we aim to match the general format for evaluation purposes, our identified text fields have a much tighter fit to facilitate typesetting. Additionally, while the English translation of the pictured text is used to evaluate the overall accuracy of our translation framework, we expect differences due to subjective measures of fluency and reader perception.

Methods

Our innovated method pipeline involves three stages with the collaboration of agents, beginning with page processing where text is recognized and clustered, then moving on to leverage multi-agent collaboration for translation, and at last the typesetting stage to return a translated version of the page with translations replacing the original texts.

Method Pipeline Diagram

Manga-Specialized Preprocessing

Image
Original image
Image
Text segmentation mask
Image
Filtered clusters (color-coded)
Image
Ordered text boxes

In the initial preprocessing stage, we implement a pipeline designed to identify the positions of text boxes and extract text from them.

  • Text Segmentation: Creates a mask using a text segmentation model specialized for manga, robust against cases such as oddly shaped text fields and non-text elements within speech bubbles.
  • Clustering: Groups potential text into clusters using the OPTICS algorithm.
  • Speech Bubble Filter: Filters out text outside speech bubbles to match baseline data by analyzing the background color of the text.
  • Page Element Ordering: Sorts clustered text boxes into reading order using the Magi model.
  • Text Extraction: Extracts text using MangaOCR, designed to be robust against scenarios specific to manga, such as text overlaid over images or accompanied by furigana, wide varieties in font style and size, and different reading orientations.

Multi-agent Context-aware Translation Design

In this stage, we introduce a multi-agent system for context-aware translation, focusing on visual-heavy content like manga and comic books. The system combines Visual Language Models (VLMs) and a multi-agent collaborative framework to enhance translation accuracy and coherence, ensuring linguistic precision and contextual relevance.

  • Visual Context Analysis: A Visual Language Model (VLM) analyzes the page's visual context, extracting key elements like characters and settings. This information helps inform the translation process, ensuring the translation is aligned with the visual nuances crucial to the narrative. For our demonstration pipeline, we uses llava 7b as the VLM agent. However, it is a small sized model with limited ability for interpretation and reasoning. If replaced with other stronger VLMs, the performance would be better.
  • Multi-agent Translation Framework: The translation is handled by a multi-agent system with three specialized agents: the Linguistic Specialist Agent for grammatical accuracy, the Cultural Context Specialist Agent for cultural relevance, and the Visual Context Specialist Agent for aligning the translation with visual elements. Each agent contributes to a well-rounded translation.
  • Incorporation of Historical Context: For series with continuous storylines, the system incorporates historical context by tracking narrative elements across multiple pages. This ensures consistency in character names, plot, and themes, maintaining coherence throughout the series.
  • Agent Collaboration and Verification: The three agents engage in a round-robin process, reviewing and critiquing each other's translations. This collaborative approach helps resolve discrepancies and refine the final translation, ensuring it is linguistically accurate, culturally appropriate, and visually coherent. This ensures the quality of final translation.
  • Fast Translation for Demo: Although the regular translation process involves agent double checking on final translation for more natural outputs, this slows down latency and adds to the cost, which would not be convinient for users who want instant access and don't mind the translation quality that much. In order to accommodate the need of this group, we enabled fast translation mode where early stop is implemented based on complexity of text and context.

Typesetting

In the final stage of our pipeline, we integrate translated text back into manga pages while preserving visual aesthetics and readability. Our system automates text removal and adaptive reformatting to ensure a natural and high-quality presentation.

  • Text Detection and Removal: Based on our preprocessing identification, we implement the simplest and most effective masking approach using rectangular masks on precisely identified text regions. This targeted approach creates clean white spaces for translations while preserving bubble structures and surrounding artwork.
  • Adaptive Text Placement: The system symmetrically expands text regions horizontally by 16 pixels while maintaining the center point. Text is dynamically wrapped, sized (22px base, 14px minimum) and centered with white outlines for optimal readability across various background elements.

Evaluation

The evaluation of the pipeline is separated into two major parts, one on the accuracy of page processing stage in capturing the correct position of texts and extracting the text accurately, the other part is in assessing the translation quality in context awareness and nuance. To effectively assess the accuracy, coherence, and context-awareness of the multi-agent translation system, we took a combination of automatic metrics such as BERTScore, as well as human evaluation to determine how well the system preserves the emotional nuance and narrative consistency.

Processing Stage Evaluation

Image
Levenshtein distance, visualized
Image
Green/blue boxes correspond to baseline/observed boxes, respectively

The preprocessing stage is evaluated on two primary elements:

  • Text Extraction:
    • Uses Levenshtein distance ratio to penalize common errors (misidentified page elements) as deletions while accurately evaluating the accuracy of the rest of the extracted text.
    • Formula used for Levenshtein Ratio
    • Also used by comparable projects (Lippmann et al.), allowing for comparison in performance.
  • Text Box Position:
    • Implemented box-center check to assess accuracy despite systematic discrepancies with baseline text boxes.
    • Major discrepancies are tighter box fit and split text boxes in our model (see figure above), which facilitate typesetting later in the pipeline.

BERTScore and Human Evaluation

To effectively assess the accuracy, coherence, and context-awareness of the multi-agent translation system, we took a combination of automatic metrics such as BERTScore, as well as human evaluation to determine how well the system preserves the emotional nuance and narrative consistency.

Formula used for BertScore
  • BERTScore: Capture semantic similarity, focusing on contextual accuracy and nuance. Evaluated by Bert F1-score.
  • Human Evaluation: Monolingual and bilingual experts rating translation for fluency, accuracy, proximity to released human translation and cultural appropriateness, choosing one best overall translation from the results of our model and one randomly selected baseline model.

Below is an example of our human evaluation survey question. We have a total of 4 different versions of the survey, each consists of ten randomly selected pages with one of the option being our model's output and the other being a randomly selected baseline's output to reduce bias. The final percentage of preference was taken as metrics.

example of human evaluation survey

Machine Evaluation

To systematically assess the quality of translations, we incorporate Machine Evaluation using the ChatGPT web application with the GPT-4o model as an automated evaluator. This approach allows us to efficiently compare different translation outputs at scale.

Challenge of Evaluation on Typesetting Stage

Unlike translation accuracy, which can be measured with structured metrics, typesetting evaluation lacks automated methods. Assessing readability, artistic consistency, and text alignment remains largely subjective.

Existing tools can assist with text placement, but no standardized framework exists for systematically evaluating typesetting quality. Factors like font choice, text integration with artwork, and complex layouts pose significant challenges for automation.

To address this gap, our work explores approaches for structured typesetting evaluation, laying the groundwork for future research.

Results

Processing Stage Results

Image
Distribution of accuracy metric results by page
Image
Distribution of processing times by page

After processing the entire OpenMantra dataset and comparing our output to the baseline, we observed the following results:

  • Text Extraction:
    • Achieves accuracy of 86.3%, with errors mostly coming from misidentified text fields that escape the filter.
    • Outperforms Lippmann et al. by almost 10%, primarily due to the implementation of the speech bubble filter.
    • Text extracted from actual dialogue fields is almost always perfectly extracted by the OCR system.
  • Text Box Position:
    • 22% of pages contain at least one text box not included in the baseline data.
    • Almost all of these are misidentified sound effects, which do not appear to significantly affect translation quality.
  • Latency:
    • Model requires an average of about six seconds of processing time per stage, primarily allotted to clustering.
    • Pages requiring extraordinarily long processing times usually have large amounts of sound effects, which add significant computational load to the clustering algorithm.
  • Please note that these results come from processing using an A40 GPU; with increased computational power, the latency is expected to be significantly lower.
Image
Latency breakdown by stage

Performance and BERTScore

table of metrics evaluation

We evaluate our pipeline against three baselines that are effective in I2I translation tasks:

  • Comparison to mono-agent baseline: our multi-agent structure outperform the baseline with both models (llama3.1:8b and gpt-4-turbo)
  • Comparison to Google Translate: out multi-agent structure using gpt-4-turbo outperforms google translate baseline.

This finding has been validated with human evaluation with results shown below.

Human Evaluation Result

human evaluation result

As shown in the plot, our model achieved a 67% winning rate against mono-agent structure, and a 86% winning rate against google translate.

Machine Evaluation

The results from the GPT-4o-based Machine Evaluation indicate that our model performed favorably in comparison to baseline models:

  • 90% preference over Google Translate, suggesting improvements in fluency and accuracy.
  • 65% preference over the mono-agent baseline, reflecting notable enhancements in translation quality.
LLM evaluation result

These findings suggest that GPT-4o's automated evaluation aligns well with human judgment and provides valuable insights into translation quality assessment.

Conclusion

Our multi-agent framework enhances manga translation by improving translation quality and aligning with human reading expectations. By distributing tasks—text detection, contextual translation, and typesetting—our system ensures more natural and readable translations.

We also developed a specialized typesetting system to seamlessly integrate translated text into manga, addressing formatting challenges often overlooked in automated approaches.

Challenges remain, including speech bubble resizing and processing latency. Future work will focus on real-time translation and expanding support for other media types.

This research sets the foundation for sophisticated multimodal translation systems that balance linguistic accuracy with visual presentation.

References

Lippmann, Philip, Konrad Skublicki, Joshua Tanner, Shonosuke Ishiwatari, and Jie Yang. 2024. “Context-Informed Machine Translation of Manga using Multimodal Large Language Models.” [Link]

Hinami, Ryota, Shonosuke Ishiwatari, Kazuhiko Yasuda, and Yusuke Matsui. 2021. “Towards Fully Automated Manga Translation.” [Link]