mono/packages/ui/docs/page-gen.md
2026-02-08 15:09:32 +01:00

2.8 KiB

Reference Image Integration in Page Generator

This document explains how user-selected reference images influence the generation process for both text and images within the AI Page Generator.

Overview

When a user selects reference images in the AI Page Generator, these images are passed to the AI model (LLM) as part of the conversation context. This enables multimodal generation, where the AI can "see" the selected images and use that visual understanding to guide its output.

data Flow

  1. Selection: Users select images via the ImagePickerDialog. These are stored as referenceImages state in AIPageGenerator.

  2. Submission: When "Generate" is clicked, the image URLs are collected and passed through CreationWizardPopup -> usePageGenerator -> runTools.

  3. Context Injection: In src/lib/openai.ts, the runTools function detects the presence of images. It constructs a multimodal user message for the OpenAI API:

    {
      "role": "user",
      "content": [
        { "type": "text", "text": "User's text prompt..." },
        { "type": "image_url", "image_url": { "url": "https://..." } },
        { "type": "image_url", "image_url": { "url": "https://..." } }
      ]
    }
    

Impact on Generation

1. Text Generation (Direct Visual Context)

The LLM (e.g., GPT-4o) directly processes the image data. This allows it to:

  • Describe the visible content of the reference images in the generated page.
  • Match the tone, style, and mood of the text to the visual aesthetics of the images.
  • Extract specific details (colors, objects, setting) from the images and incorporate them into the narrative.

2. Image Generation (Indirect Prompt Alignment)

Currently, reference images are NOT passed as direct inputs (img2img) to the underlying image generation tools (generate_image or generate_markdown_image).

Instead, the reference images influence image generation indirectly via the LLM:

  1. The LLM "sees" the reference images and understands their style, composition, and subject matter.
  2. When the LLM decides to generating new images for the page (using generate_text_with_images), it writes the image generation prompts based on this visual understanding.
  3. Result: The newly generated images are likely to be stylistically consistent with the reference images because the prompts used to generate them were crafted by an AI that "saw" the references.

Schema Reference

  • runTools (openai.ts): Accepts images: string[] and builds the multimodal message.
  • generate_text_with_images (markdownImageTools.ts): Accepts text prompts for new images, but does not accept input images.
  • generate_image (openai.ts): Accepts text prompts, count, and model, but does not accept input images.