Skip to content

Translating large documents (> 40 MB) with preserved links and references #136

@holadaniv

Description

@holadaniv

Hi everyone,
First of all, thank you for taking the time to look into this—I really appreciate your help! I’m fairly new to programming, and I’ve set myself the challenge of translating very large documents (~250 MB) without losing any of their hyperlinks or internal references. It’s probably a bit ambitious for my current skill level, but I’m eager to learn and would love any guidance you can offer.

--

What I’m Trying to Do

I need to translate documents larger than 40 MB and keep all hyperlinks and internal cross-references intact.

What I’ve Tried So Far

  • DocumentTranslator-Legacy

    • ✅ Successfully translates files up to 250 MB
    • ❌ Unfortunately, all hyperlinks and cross-references are stripped out in the output
  • DocumentTranslation (new)

    • ❌ Fails immediately on any file over 40 MB (as documented)
    • ✅ Works perfectly on files under 40 MB and preserves every link and reference

Current Testing Status

I have successfully built a small personal web project that replicates the capabilities of the new DocumentTranslation service—glossaries, custom translation, and more—and it works fine for my purpose. I’m very satisfied with the results, but the 40 MB limit still prevents me from translating some of my larger documents. I’m now exploring whether there’s a way to achieve the same or similar outcomes with larger files.

What I’d Like to Happen

Translate documents above 40 MB without breaking any hyperlinks or internal references.

Background / Approach Comparison

If I understand correctly, the new DocumentTranslation service sends the entire file to Azure’s translation API, which per documentation is capped at 40 MB. The legacy translator, on the other hand, uses Open XML to extract and chunk the text, translates it as plain text, then reinserts it—preserving some formatting but losing links and references.

My Question

Based on these two approaches, I see two possible paths forward:

  1. New service approach: Is there any way to raise or work around Azure’s 40 MB document limit?
  2. Legacy approach: Is there a better method to handle extraction/reinsertion so that all hyperlinks and internal references survive the translation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions