mono/packages/ui/docs/i18n.md
2026-02-19 09:24:43 +01:00

16 KiB

i18n — Content Translation & Versioning

Proposal for translating pages, widgets, and other content types with version tracking.


Status Quo

What exists Where
i18n_translations — flat src_text → dst_text cache db-i18n.ts
i18n_glossaries / i18n_glossary_terms — DeepL glossary sync db-i18n.ts
DeepL server-side translate — translate + cache in one call i18n-deepl.ts
@polymech/i18n — shared clean() helper etc. monorepo package

The existing system translates arbitrary text blobs. It has no awareness of:

  • Which page / widget a translation belongs to
  • Which version of the source content was translated
  • Structural identity — if a widget moves or is deleted, orphaned translations linger

Goals

  1. Page-level translations — a translated "snapshot" of an entire page
  2. Widget-level translations — translate individual widget text props independently
  3. Content versioning — track which source version a translation was produced from, detect drift
  4. Reuse existing infrai18n_translations stays as the text cache, DeepL stays as the engine

Proposed Database Schema

1. content_versions

Tracks every published snapshot of any content entity (pages, posts, collections, …).

create table content_versions (
  id            uuid primary key default gen_random_uuid(),
  entity_type   text not null,              -- 'page' | 'post' | 'collection'
  entity_id     uuid not null,              -- pages.id / posts.id / …
  version       int  not null default 1,    -- monotonic per entity
  content_hash  text not null,              -- sha256 of JSON content
  content       jsonb,                      -- snapshot of content at this version (optional, for rollback)
  meta          jsonb default '{}',         -- { author, change_note, … }
  created_at    timestamptz default now(),
  created_by    uuid references auth.users(id),

  unique (entity_type, entity_id, version)
);

create index idx_cv_entity on content_versions (entity_type, entity_id);

Why a separate table?
The pages table stores the current working state.
content_versions stores immutable snapshots you can diff, rollback, or translate against.


2. content_translations

Links a translated content blob to a specific source version + language.

create type translation_status as enum ('draft', 'machine', 'reviewed', 'published');

create table content_translations (
  id               uuid primary key default gen_random_uuid(),
  entity_type      text not null,
  entity_id        uuid not null,
  source_version   int  not null,            -- FK-like ref to content_versions.version
  source_lang      text not null default 'de',
  target_lang      text not null,
  status           translation_status default 'draft',

  -- Translated payload (same shape as source content)
  translated_content  jsonb,                 -- full page JSON with translated strings

  -- Drift detection
  source_hash      text,                     -- hash of source at translation time
  is_stale         boolean default false,    -- set true when source gets a newer version

  meta             jsonb default '{}',       -- { translator, provider, cost, … }
  created_at       timestamptz default now(),
  updated_at       timestamptz default now(),
  translated_by    uuid references auth.users(id),

  unique (entity_type, entity_id, source_version, target_lang)
);

create index idx_ct_entity on content_translations (entity_type, entity_id, target_lang);

3. widget_translations (optional — granular level)

For widget-by-widget translation without duplicating the whole page JSON.

create table widget_translations (
  id              uuid primary key default gen_random_uuid(),
  entity_type     text not null default 'page',
  entity_id       uuid not null,
  widget_id       text not null,              -- WidgetInstance.id from the JSON tree
  prop_path       text not null default 'content',  -- e.g. 'content', 'label', 'placeholder'
  source_lang     text not null,
  target_lang     text not null,
  source_text     text not null,
  translated_text text not null,
  source_version  int,                        -- which content_version this was derived from
  status          translation_status default 'machine',
  meta            jsonb default '{}',
  created_at      timestamptz default now(),
  updated_at      timestamptz default now(),

  unique (entity_type, entity_id, widget_id, prop_path, target_lang)
);

create index idx_wt_entity on widget_translations (entity_type, entity_id, target_lang);

Why both content_translations and widget_translations?

  • content_translations = "give me the whole page in French" (fast serve)
  • widget_translations = "give me just widget X in French" (granular edit, partial retranslation)
    When serving, we prefer content_translations (single read). When editing, we use widget_translations for surgical updates.

Translatable Widget Props

Not every widget property needs translation. Here's the map of translatable text:

Widget Type Translatable Props
html-widget content
markdown-text content
tabs-widget tabs[].label
layout-container-widget nestedPageName
photo-card (title/description from pictures table)
gallery-widget
file-browser
Container (settings) settings.title

The shared function iterateWidgets() from @polymech/shared can walk the full content tree to extract translatable strings per widget.


Content Versioning Flow

flowchart TD
    A["Page Editor"] -->|save| B["pages.content — working draft"]
    B -->|publish / snapshot| C["content_versions — immutable v1, v2, ..."]
    C -->|translate via DeepL / manual| D["content_translations — per version + lang"]

Version Lifecycle

  1. Author savespages.content updated (working state, no version bump)
  2. Author publishes → new row in content_versions (hash of content JSON, version++)
  3. Translation triggered → walks content tree, translates per widget, stores widget_translations + assembles a full content_translations row
  4. Source changes → next publish creates version N+1, all content_translations for version N get is_stale = true
  5. Retranslation → only re-translates widgets whose source_text changed (compare hashes)

Serving Translated Pages

When a page is requested with ?lang=fr:

1. Look up content_translations WHERE entity_id = ? AND target_lang = 'fr' AND status = 'published'
2. If found → serve translated_content directly (no extra processing)
3. If not found → serve source content (fallback)
4. If is_stale = true → serve but add X-Translation-Stale: true header

Add lang to the enrichment / cache key in getPagesState() or create a parallel getTranslatedPagesState().


Integration with Existing i18n

The existing i18n_translations table continues to serve as the text-level translation cache (src → dst lookup). The new tables add structural awareness on top:

i18n_translations        → text cache (DeepL results, any text)
widget_translations      → maps widget+prop → translation pair
content_translations     → full translated content snapshot
content_versions         → immutable source snapshots

translateTextServer() (from i18n-deepl.ts) remains the engine. The new translation logic calls it per widget prop, then assembles results.


External Translation Services (Crowdin, Phrase, Lokalise)

The Problem

Our page content is deeply nested JSON (RootLayoutData → pages → containers → widgets → props). External TMS platforms don't understand this structure — they work with flat key→value files in standard formats.

We need an extract/inject pipeline that converts between our JSON tree and industry-standard formats.

Exchange Format Strategy

Format Best For Crowdin Phrase Lokalise
XLIFF 2.0 Industry standard, rich metadata, tool support
Flat JSON Simple key→value, easy to diff
ICU MessageFormat Plurals, gender, variables

Recommended primary format: XLIFF 2.0 — it carries source + target in one file, supports notes/context for translators, and every TMS speaks it natively.

Secondary: Flat JSON — for scripting, quick diffs, and lightweight integrations.

Key Design — Stable Translation Keys

Every translatable string gets a stable key derived from its position in the content tree:

page.<page_id>.widget.<widget_id>.<prop_path>

Examples:

page.a1b2c3.widget.w-markdown-1.content
page.a1b2c3.widget.w-tabs-1.tabs.0.label
page.a1b2c3.widget.w-tabs-1.tabs.1.label
page.a1b2c3.container.c-hero.settings.title
page.a1b2c3.meta.title                        ← page title itself

These keys are widget-ID-based, not position-based. If a widget moves within the page, its key stays the same. If a widget is deleted, its key disappears from the next export.

XLIFF Export Example

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="2.0" srcLang="de" trgLang="en">
  <file id="page-a1b2c3" original="page/a1b2c3">
    <unit id="page.a1b2c3.meta.title">
      <notes>
        <note category="context">Page title</note>
        <note category="max-length">255</note>
      </notes>
      <segment>
        <source>Kunststoff-Recycling Übersicht</source>
        <target>Plastic Recycling Overview</target>
      </segment>
    </unit>
    <unit id="page.a1b2c3.widget.w-md-1.content">
      <notes>
        <note category="context">Markdown text widget — supports markdown formatting</note>
        <note category="widget-type">markdown-text</note>
      </notes>
      <segment>
        <source>## Einleitung\n\nDiese Seite beschreibt...</source>
        <target/>
      </segment>
    </unit>
    <unit id="page.a1b2c3.widget.w-tabs-1.tabs.0.label">
      <notes>
        <note category="context">Tab label</note>
        <note category="max-length">50</note>
      </notes>
      <segment>
        <source>Übersicht</source>
        <target/>
      </segment>
    </unit>
  </file>
</xliff>

Flat JSON Export Example

{
  "_meta": {
    "entity_type": "page",
    "entity_id": "a1b2c3",
    "source_version": 3,
    "source_lang": "de",
    "exported_at": "2026-02-17T10:00:00Z"
  },
  "page.a1b2c3.meta.title": "Kunststoff-Recycling Übersicht",
  "page.a1b2c3.widget.w-md-1.content": "## Einleitung\n\nDiese Seite beschreibt...",
  "page.a1b2c3.widget.w-tabs-1.tabs.0.label": "Übersicht",
  "page.a1b2c3.widget.w-tabs-1.tabs.1.label": "Details",
  "page.a1b2c3.container.c-hero.settings.title": "Willkommen"
}

Extract → Export → Translate → Import → Inject Pipeline

flowchart LR
    subgraph OUR_SYSTEM["Our System"]
        CV["content_versions v3"] -->|"1 EXTRACT\niterateWidgets"| KV["Flat key-value map"]
        KV -->|"2 EXPORT\nserialize to XLIFF or JSON"| FILE_OUT[".xliff / .json file"]
        FILE_IN["Translated .xliff / .json"] -->|"3 IMPORT\nparse to key-value map"| KV_TR["Translated key-value map"]
        KV_TR -->|"4 INJECT\nwalk tree, replace strings"| CT["content_translations"]
        KV_TR -->|"4 INJECT"| WT["widget_translations"]
    end

    subgraph TMS["External TMS"]
        CROWDIN["Crowdin / Phrase / Lokalise"]
        HUMAN["Human translators + MT review"]
        CROWDIN --> HUMAN
        HUMAN --> CROWDIN
    end

    FILE_OUT --> CROWDIN
    CROWDIN --> FILE_IN

How Human Translation Fits the Status Flow

flowchart TD
    A["Machine translate via DeepL"] --> B["status = machine"]
    B --> C["Export to TMS"]
    C --> D["Human review and edit"]
    D --> E["Import back"]
    E --> F["status = reviewed"]
    F --> G["Editor approves"]
    G --> H["status = published"]
  1. Machine pre-fill: DeepL translates all strings → stored with status = 'machine'
  2. Export to TMS: export the machine-translated file (with source + target pre-filled) so human translators only need to review and fix, not translate from scratch
  3. Import from TMS: translated file comes back → status = 'reviewed'
  4. Publish: editor approves → status = 'published', content_translations assembled

API Additions for TMS Interop

Method Endpoint Description
GET /api/pages/:id/export/:lang?format=xliff Export translatable strings as XLIFF or JSON
POST /api/pages/:id/import/:lang Import translated XLIFF or JSON file
GET /api/pages/:id/export/:lang?format=json Export as flat JSON
POST /api/i18n/webhook/crowdin Crowdin webhook for auto-import on completion

Crowdin-Specific Integration Notes

  • Source files: upload the flat JSON export as a "source file" per page
  • File naming: page-{slug}-v{version}.json — Crowdin tracks versions by filename
  • Branches: use Crowdin branches to match content_versions — branch = version
  • Webhooks: Crowdin fires file.translated / file.approved → our webhook imports
  • In-Context: Crowdin's in-context editing can work via our ?lang=pseudo mode that renders keys instead of text

Glossary Sync

The existing i18n_glossaries / i18n_glossary_terms tables can be:

  • Exported as TBX (TermBase eXchange) or Crowdin-compatible CSV
  • Synced bidirectionally: terms added in Crowdin → imported to our DB → pushed to DeepL glossary

This keeps DeepL machine translations and human translations using the same terminology.


API Surface (Proposed)

Method Endpoint Description
POST /api/pages/:id/publish Snapshot current content → content_versions
GET /api/pages/:id/versions List versions for a page
GET /api/pages/:id/versions/:v Get specific version snapshot
POST /api/pages/:id/translate Translate page to target lang(s)
GET /api/pages/:id/translations List available translations
GET /api/pages/:id/translations/:lang Get translated content for lang
PATCH /api/pages/:id/translations/:lang/widgets/:wid Update single widget translation
POST /api/content/:type/:id/publish Generic publish for any entity type
POST /api/content/:type/:id/translate Generic translate for any entity type

Open Questions / Decisions Needed

  1. Publish-on-save vs explicit publish?
    Do we auto-version on every save, or require an explicit "Publish" action?
    Recommend: explicit publish to avoid version spam.

  2. Widget-level table — now or later?
    widget_translations adds complexity. We could start with page-level only (content_translations) and add widget-level later.
    Recommend: start with both — widget-level is needed for partial retranslation.

  3. Store full content in content_versions or just the hash?
    Storing full JSON enables rollback but costs storage.
    Recommend: store it — pages are small (< 100 KB each), rollback is high value.

  4. Which entity types beyond pages?
    Posts? Collections? Categories?
    Recommend: start with pages only, the schema is generic enough to extend.

  5. UI for translation management?
    A side-by-side translation editor? Or just an "auto-translate" button?
    This doc covers the backend schema only — UI TBD.


Migration Priority

Phase Scope Tables
Phase 1 Content versioning for pages content_versions
Phase 2 Page-level translations content_translations
Phase 3 Widget-level translations widget_translations
Phase 4 Extend to posts / collections Same tables, new entity_type values