llm-plastic/README.md
2025-04-20 13:31:18 +02:00

83 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Complete Markdown Guide — Training & Serving an OpenSource LLM on Your Own PDFs / Chat Logs
> **Goal:** Start with raw PDFs, Discord logs, forum dumps, or scraped HTML and end with a tuned, productionready largelanguagemodel (LLM).
> **Toolchain:** 100 % opensource; every major step is linked.
---
### 0 . Quick birdseye view
| Phase | Core Tools | What happens |
|-------|------------|--------------|
| **Ingest & Clean** | `unstructured`, LangChain loaders | Parse PDFs / chats → structured text & metadata |
| **Dataset Build** | 🤗 `datasets`, dedup scripts | Combine, filter, split into training / eval |
| **FineTuning (SFT / QLoRA)** | Axolotl, Unsloth, Torchtune, PEFT | Lightweight parameterefficient updates |
| **Alignment (RLHF / DPO)** | TRL | Reward modelling & preference optimization |
| **Distributed Training** | DeepSpeed, ColossalAI | MultiGPU / multinode scaling |
| **Evaluation** | `lmevalharness`, Ragas | Benchmarks + domainspecific tests |
| **Serving** | vLLM, Hugging Face TGI | Fast, OpenAIcompatible inference APIs |
---
### Finetuning
## 1 . Set up the environment
```bash
# CUDA 12ready image (example)
docker run --gpus all -it --shm-size 64g nvcr.io/nvidia/pytorch:24.04-py3 bash
conda create -n llm python=3.11 && conda activate llm
pip install "torch>=2.2" "transformers>=4.40" accelerate bitsandbytes
---
# FineTuningOnly vs. Full Alignment Pipeline for OpenSource LLMs
| Aspect | FineTuningOnly<br>(SFT / LoRA / QLoRA) | Full Pipeline<br>(SFTDPO / RLHF) |
|--------|------------------------------------------|--------------------------------------|
| **Purpose** | Adapt model to new domain/tasks | Adapt **and** align answers with human preferences, safety rules |
| **Data needed** | 1100k *single* instructionresponse pairs | Same SFT set **plus** 5100k *preference* pairs (chosen+rejected) |
| **Compute** | Fits on 1×A10040GB for an 8B model with QLoRA (60% VRAM, +39% time) :contentReference[oaicite:0]{index=0} | Adds rewardmodel + alignment pass ⇒ 23× GPU hours (DPO cheapest) :contentReference[oaicite:1]{index=1} |
| **Training time** | Hours → LlamaFactory LoRA =3.7× faster than PTuning :contentReference[oaicite:2]{index=2} | Hourstodays; alignment stage can add 50200% walltime :contentReference[oaicite:3]{index=3} |
| **Cash cost** | Example SFTonly: 72.5h on 2×A100 ≈€200 | Same run +1 DPO epoch ⇒€312 (+56%) :contentReference[oaicite:4]{index=4} |
| **Quality gains** | ↑task accuracy, but may hallucinate or be offstyle | +1830% winrate on preference evals; fewer toxic / offpolicy replies :contentReference[oaicite:5]{index=5} |
| **Safety** | Relies on prompt guardrails | Alignment directly penalises unsafe outputs |
| **Complexity** | Single command; no extra data pipeline | Multistage: collect feedback, train reward, run PPO/DPO, tune hyperparams |
| **When to choose** | Narrow internal apps, lowrisk use, tight budget | Publicfacing chatbots, regulated domains, brandsensitive content |
---
## SlimmedDown “FineTuningOnly” Recipe (QLoRA + Unsloth)
```bash
pip install "unsloth[colab-new]" datasets bitsandbytes accelerate
python -m unsloth.finetune \
--model meta-llama/Meta-Llama-3-8B \
--dataset ./my_corpus.parquet \
--lora_r 16 --lora_alpha 32 --lr 2e-5 --epochs 3 \
--output_dir llama3-domain-qlora
## Portability concerns
Form | Size (7 B model) | Typical file | Pros | Cons
LoRA / QLoRA adapter(only Δweights) | 25200 MB | adapter_model.safetensors | Tiny, upload to HF Hub, stack several at once, hotswap at inference | Needs identical base checkpoint & tokenizer; some runtimes must support LoRA
Merged FP16 weights | 816 GB | pytorch_model00001of00002.safetensors | Single selfcontained model; any engine that speaks HuggingFace can load it | Heavy; requantise for each HW target
Quantised GGUF | 28 GB | model.Q4_K_M.gguf | Runs on CPU / mobiles with llama.cpp, Ollama; LoRA can be loaded too | GPU engines (vLLM/TGI) ignore gguf
2. Runtimes & current LoRA support
Runtime / Client | Load LoRA live? | Merge required? | Note
🤗 Transformers | ✅ peft.auto | ❌ | canonical reference
vLLM ≥ 0.4 | ✅ --enablelora → pulls from HF Hub GitHub | ❌ | remote LoRA download at startup
Hugging Face TGI 1.3 | ✅ (--lora-adapter) | ❌ | hotswap w/out restart
llama.cpp / GGUF | ✅ load LoRA GGUF sidefile (or merge) GitHub | ❌ | convert PEFT LoRA → GGUF first
ONNX / TensorRT | ⚠️ must be merged first | ✅ | quantise after merge
3. Format conversions you can rely on
Conversion | Command / Tool | Portability gain
PEFT LoRA → GGUF | llama.cpp/convert_lora_to_gguf.py | lets CPUonly clients consume your finetune GitHub
Merged weights → GGUF | python llama.cpp/convert.py --outtype q4_0 GitHub | shrink & run on laptops
PyTorch → Safetensors | model.save_pretrained(..., safe_serialization=True) GitHub | faster, picklefree, HFnative