mono/packages/ui/docs/page-gen.md
2026-02-08 15:09:32 +01:00

51 lines
2.8 KiB
Markdown

# Reference Image Integration in Page Generator
This document explains how user-selected reference images influence the generation process for both text and images within the AI Page Generator.
## Overview
When a user selects reference images in the AI Page Generator, these images are passed to the AI model (LLM) as part of the conversation context. This enables **multimodal generation**, where the AI can "see" the selected images and use that visual understanding to guide its output.
## data Flow
1. **Selection**: Users select images via the `ImagePickerDialog`. These are stored as `referenceImages` state in `AIPageGenerator`.
2. **Submission**: When "Generate" is clicked, the image URLs are collected and passed through `CreationWizardPopup` -> `usePageGenerator` -> `runTools`.
3. **Context Injection**: In `src/lib/openai.ts`, the `runTools` function detects the presence of images. It constructs a **multimodal user message** for the OpenAI API:
```json
{
"role": "user",
"content": [
{ "type": "text", "text": "User's text prompt..." },
{ "type": "image_url", "image_url": { "url": "https://..." } },
{ "type": "image_url", "image_url": { "url": "https://..." } }
]
}
```
## Impact on Generation
### 1. Text Generation (Direct Visual Context)
The LLM (e.g., GPT-4o) directly processes the image data. This allows it to:
* Describe the visible content of the reference images in the generated page.
* Match the tone, style, and mood of the text to the visual aesthetics of the images.
* Extract specific details (colors, objects, setting) from the images and incorporate them into the narrative.
### 2. Image Generation (Indirect Prompt Alignment)
Currently, **reference images are NOT passed as direct inputs** (img2img) to the underlying image generation tools (`generate_image` or `generate_markdown_image`).
Instead, the reference images influence image generation **indirectly via the LLM**:
1. The LLM "sees" the reference images and understands their style, composition, and subject matter.
2. When the LLM decides to generating *new* images for the page (using `generate_text_with_images`), it writes the **image generation prompts** based on this visual understanding.
3. **Result**: The newly generated images are likely to be stylistically consistent with the reference images because the prompts used to generate them were crafted by an AI that "saw" the references.
## Schema Reference
* **`runTools` (`openai.ts`)**: Accepts `images: string[]` and builds the multimodal message.
* **`generate_text_with_images` (`markdownImageTools.ts`)**: Accepts text prompts for new images, but does not accept input images.
* **`generate_image` (`openai.ts`)**: Accepts text prompts, count, and model, but does not accept input images.