generated from polymech/astro-shopify
There will be light :)
This commit is contained in:
parent
6c8e24cb68
commit
af265e1118
83
README.md
83
README.md
@ -0,0 +1,83 @@
|
|||||||
|
## Complete Markdown Guide — Training & Serving an Open‑Source LLM on Your Own PDFs / Chat Logs
|
||||||
|
|
||||||
|
> **Goal:** Start with raw PDFs, Discord logs, forum dumps, or scraped HTML and end with a tuned, production‑ready large‑language‑model (LLM).
|
||||||
|
> **Toolchain:** 100 % open‑source; every major step is linked.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 0 . Quick bird’s‑eye view
|
||||||
|
|
||||||
|
| Phase | Core Tools | What happens |
|
||||||
|
|-------|------------|--------------|
|
||||||
|
| **Ingest & Clean** | `unstructured`, LangChain loaders | Parse PDFs / chats → structured text & metadata |
|
||||||
|
| **Dataset Build** | 🤗 `datasets`, dedup scripts | Combine, filter, split into training / eval |
|
||||||
|
| **Fine‑Tuning (SFT / QLoRA)** | Axolotl, Unsloth, Torchtune, PEFT | Lightweight parameter‑efficient updates |
|
||||||
|
| **Alignment (RLHF / DPO)** | TRL | Reward modelling & preference optimization |
|
||||||
|
| **Distributed Training** | DeepSpeed, Colossal‑AI | Multi‑GPU / multi‑node scaling |
|
||||||
|
| **Evaluation** | `lm‑eval‑harness`, Ragas | Benchmarks + domain‑specific tests |
|
||||||
|
| **Serving** | vLLM, Hugging Face TGI | Fast, OpenAI‑compatible inference APIs |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Finetuning
|
||||||
|
|
||||||
|
## 1 . Set up the environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# CUDA 12‑ready image (example)
|
||||||
|
docker run --gpus all -it --shm-size 64g nvcr.io/nvidia/pytorch:24.04-py3 bash
|
||||||
|
conda create -n llm python=3.11 && conda activate llm
|
||||||
|
pip install "torch>=2.2" "transformers>=4.40" accelerate bitsandbytes
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Fine‑Tuning‑Only vs. Full Alignment Pipeline for Open‑Source LLMs
|
||||||
|
|
||||||
|
| Aspect | Fine‑Tuning Only<br>(SFT / LoRA / QLoRA) | Full Pipeline<br>(SFT → DPO / RLHF) |
|
||||||
|
|--------|------------------------------------------|--------------------------------------|
|
||||||
|
| **Purpose** | Adapt model to new domain/tasks | Adapt **and** align answers with human preferences, safety rules |
|
||||||
|
| **Data needed** | 1‑100 k *single* instruction‑response pairs | Same SFT set **plus** 5‑100 k *preference* pairs (chosen + rejected) |
|
||||||
|
| **Compute** | Fits on 1 × A100‑40 GB for an 8 B model with QLoRA (−60 % VRAM, +39 % time) :contentReference[oaicite:0]{index=0} | Adds reward‑model + alignment pass ⇒ ≈ 2‑3× GPU hours (DPO cheapest) :contentReference[oaicite:1]{index=1} |
|
||||||
|
| **Training time** | Hours → Llama‑Factory LoRA = 3.7 × faster than P‑Tuning :contentReference[oaicite:2]{index=2} | Hours‑to‑days; alignment stage can add 50‑200 % wall‑time :contentReference[oaicite:3]{index=3} |
|
||||||
|
| **Cash cost** | Example SFT‑only: 72.5 h on 2 × A100 ≈ €200 | Same run + 1 DPO epoch ⇒ €312 (+56 %) :contentReference[oaicite:4]{index=4} |
|
||||||
|
| **Quality gains** | ↑ task accuracy, but may hallucinate or be off‑style | +18–30 % win‑rate on preference evals; fewer toxic / off‑policy replies :contentReference[oaicite:5]{index=5} |
|
||||||
|
| **Safety** | Relies on prompt guardrails | Alignment directly penalises unsafe outputs |
|
||||||
|
| **Complexity** | Single command; no extra data pipeline | Multi‑stage: collect feedback, train reward, run PPO/DPO, tune hyper‑params |
|
||||||
|
| **When to choose** | Narrow internal apps, low‑risk use, tight budget | Public‑facing chatbots, regulated domains, brand‑sensitive content |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Slimmed‑Down “Fine‑Tuning‑Only” Recipe (QLoRA + Unsloth)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install "unsloth[colab-new]" datasets bitsandbytes accelerate
|
||||||
|
python -m unsloth.finetune \
|
||||||
|
--model meta-llama/Meta-Llama-3-8B \
|
||||||
|
--dataset ./my_corpus.parquet \
|
||||||
|
--lora_r 16 --lora_alpha 32 --lr 2e-5 --epochs 3 \
|
||||||
|
--output_dir llama3-domain-qlora
|
||||||
|
|
||||||
|
|
||||||
|
## Portability concerns
|
||||||
|
|
||||||
|
Form | Size (7 B model) | Typical file | Pros | Cons
|
||||||
|
LoRA / QLoRA adapter(only Δ‑weights) | 25‑200 MB | adapter_model.safetensors | Tiny, upload to HF Hub, stack several at once, hot‑swap at inference | Needs identical base checkpoint & tokenizer; some runtimes must support LoRA
|
||||||
|
Merged FP16 weights | 8‑16 GB | pytorch_model‑00001‑of‑00002.safetensors | Single self‑contained model; any engine that speaks Hugging‑Face can load it | Heavy; re‑quantise for each HW target
|
||||||
|
Quantised GGUF | 2‑8 GB | model.Q4_K_M.gguf | Runs on CPU / mobiles with llama.cpp, Ollama; LoRA can be loaded too | GPU engines (vLLM/TGI) ignore gguf
|
||||||
|
|
||||||
|
2. Runtimes & current LoRA support
|
||||||
|
|
||||||
|
Runtime / Client | Load LoRA live? | Merge required? | Note
|
||||||
|
🤗 Transformers | ✅ peft.auto | ❌ | canonical reference
|
||||||
|
vLLM ≥ 0.4 | ✅ --enable‑lora → pulls from HF Hub GitHub | ❌ | remote LoRA download at startup
|
||||||
|
Hugging Face TGI 1.3 | ✅ (--lora-adapter) | ❌ | hot‑swap w/out restart
|
||||||
|
llama.cpp / GGUF | ✅ load LoRA GGUF side‑file (or merge) GitHub | ❌ | convert PEFT LoRA → GGUF first
|
||||||
|
ONNX / TensorRT | ⚠️ must be merged first | ✅ | quantise after merge
|
||||||
|
|
||||||
|
3. Format conversions you can rely on
|
||||||
|
|
||||||
|
Conversion | Command / Tool | Portability gain
|
||||||
|
PEFT LoRA → GGUF | llama.cpp/convert_lora_to_gguf.py | lets CPU‑only clients consume your fine‑tune GitHub
|
||||||
|
Merged weights → GGUF | python llama.cpp/convert.py --outtype q4_0 GitHub | shrink & run on laptops
|
||||||
|
PyTorch → Safetensors | model.save_pretrained(..., safe_serialization=True) GitHub | faster, pickle‑free, HF‑native
|
||||||
@ -1 +1,3 @@
|
|||||||
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --mode=completion --api_key=test --model=MFDoom/deepseek-r1-tool-calling:latest
|
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --api_key=test --model=MFDoom/deepseek-r1-tool-calling:latest --tools=fs
|
||||||
|
|
||||||
|
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --api_key=test --model=erwan2/DeepSeek-R1-Distill-Qwen-1.5B --tools=fs
|
||||||
@ -2,3 +2,5 @@
|
|||||||
|
|
||||||
#ollama run MFDoom/deepseek-r1-tool-calling
|
#ollama run MFDoom/deepseek-r1-tool-calling
|
||||||
ollama run nezahatkorkmaz/deepseek-v3
|
ollama run nezahatkorkmaz/deepseek-v3
|
||||||
|
ollama run MFDoom/deepseek-r1-tool-calling:7b
|
||||||
|
ollama run erwan2/DeepSeek-R1-Distill-Qwen-1.5B
|
||||||
Loading…
Reference in New Issue
Block a user