mono/packages/ui/docs/page-gen.md

# Reference Image Integration in Page Generator

This document explains how user-selected reference images influence the generation process for both text and images within the AI Page Generator.

## Overview

When a user selects reference images in the AI Page Generator, these images are passed to the AI model (LLM) as part of the conversation context. This enables **multimodal generation**, where the AI can "see" the selected images and use that visual understanding to guide its output.

## data Flow

1. **Selection**: Users select images via the `ImagePickerDialog`. These are stored as `referenceImages` state in `AIPageGenerator`.
2. **Submission**: When "Generate" is clicked, the image URLs are collected and passed through `CreationWizardPopup` -> `usePageGenerator` -> `runTools`.
3. **Context Injection**: In `src/lib/openai.ts`, the `runTools` function detects the presence of images. It constructs a **multimodal user message** for the OpenAI API:

    ```json
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "User's text prompt..." },
        { "type": "image_url", "image_url": { "url": "https://..." } },
        { "type": "image_url", "image_url": { "url": "https://..." } }
      ]
    }
    ```

## Impact on Generation

### 1. Text Generation (Direct Visual Context)

The LLM (e.g., GPT-4o) directly processes the image data. This allows it to:

* Describe the visible content of the reference images in the generated page.
* Match the tone, style, and mood of the text to the visual aesthetics of the images.
* Extract specific details (colors, objects, setting) from the images and incorporate them into the narrative.

### 2. Image Generation (Indirect Prompt Alignment)

Currently, **reference images are NOT passed as direct inputs** (img2img) to the underlying image generation tools (`generate_image` or `generate_markdown_image`).

Instead, the reference images influence image generation **indirectly via the LLM**:

1. The LLM "sees" the reference images and understands their style, composition, and subject matter.
2. When the LLM decides to generating *new* images for the page (using `generate_text_with_images`), it writes the **image generation prompts** based on this visual understanding.
3. **Result**: The newly generated images are likely to be stylistically consistent with the reference images because the prompts used to generate them were crafted by an AI that "saw" the references.

## Schema Reference

* **`runTools` (`openai.ts`)**: Accepts `images: string[]` and builds the multimodal message.
* **`generate_text_with_images` (`markdownImageTools.ts`)**: Accepts text prompts for new images, but does not accept input images.
* **`generate_image` (`openai.ts`)**: Accepts text prompts, count, and model, but does not accept input images.