There will be light :)

This commit is contained in:
lovebird 2025-04-20 13:31:18 +02:00
parent 6c8e24cb68
commit af265e1118
3 changed files with 89 additions and 2 deletions

View File

@ -0,0 +1,83 @@
## Complete Markdown Guide — Training & Serving an OpenSource LLM on Your Own PDFs / Chat Logs
> **Goal:** Start with raw PDFs, Discord logs, forum dumps, or scraped HTML and end with a tuned, productionready largelanguagemodel (LLM).
> **Toolchain:** 100 % opensource; every major step is linked.
---
### 0 . Quick birdseye view
| Phase | Core Tools | What happens |
|-------|------------|--------------|
| **Ingest & Clean** | `unstructured`, LangChain loaders | Parse PDFs / chats → structured text & metadata |
| **Dataset Build** | 🤗 `datasets`, dedup scripts | Combine, filter, split into training / eval |
| **FineTuning (SFT / QLoRA)** | Axolotl, Unsloth, Torchtune, PEFT | Lightweight parameterefficient updates |
| **Alignment (RLHF / DPO)** | TRL | Reward modelling & preference optimization |
| **Distributed Training** | DeepSpeed, ColossalAI | MultiGPU / multinode scaling |
| **Evaluation** | `lmevalharness`, Ragas | Benchmarks + domainspecific tests |
| **Serving** | vLLM, Hugging Face TGI | Fast, OpenAIcompatible inference APIs |
---
### Finetuning
## 1 . Set up the environment
```bash
# CUDA 12ready image (example)
docker run --gpus all -it --shm-size 64g nvcr.io/nvidia/pytorch:24.04-py3 bash
conda create -n llm python=3.11 && conda activate llm
pip install "torch>=2.2" "transformers>=4.40" accelerate bitsandbytes
---
# FineTuningOnly vs. Full Alignment Pipeline for OpenSource LLMs
| Aspect | FineTuningOnly<br>(SFT / LoRA / QLoRA) | Full Pipeline<br>(SFTDPO / RLHF) |
|--------|------------------------------------------|--------------------------------------|
| **Purpose** | Adapt model to new domain/tasks | Adapt **and** align answers with human preferences, safety rules |
| **Data needed** | 1100k *single* instructionresponse pairs | Same SFT set **plus** 5100k *preference* pairs (chosen+rejected) |
| **Compute** | Fits on 1×A10040GB for an 8B model with QLoRA (60% VRAM, +39% time) :contentReference[oaicite:0]{index=0} | Adds rewardmodel + alignment pass ⇒ 23× GPU hours (DPO cheapest) :contentReference[oaicite:1]{index=1} |
| **Training time** | Hours → LlamaFactory LoRA =3.7× faster than PTuning :contentReference[oaicite:2]{index=2} | Hourstodays; alignment stage can add 50200% walltime :contentReference[oaicite:3]{index=3} |
| **Cash cost** | Example SFTonly: 72.5h on 2×A100 ≈€200 | Same run +1 DPO epoch ⇒€312 (+56%) :contentReference[oaicite:4]{index=4} |
| **Quality gains** | ↑task accuracy, but may hallucinate or be offstyle | +1830% winrate on preference evals; fewer toxic / offpolicy replies :contentReference[oaicite:5]{index=5} |
| **Safety** | Relies on prompt guardrails | Alignment directly penalises unsafe outputs |
| **Complexity** | Single command; no extra data pipeline | Multistage: collect feedback, train reward, run PPO/DPO, tune hyperparams |
| **When to choose** | Narrow internal apps, lowrisk use, tight budget | Publicfacing chatbots, regulated domains, brandsensitive content |
---
## SlimmedDown “FineTuningOnly” Recipe (QLoRA + Unsloth)
```bash
pip install "unsloth[colab-new]" datasets bitsandbytes accelerate
python -m unsloth.finetune \
--model meta-llama/Meta-Llama-3-8B \
--dataset ./my_corpus.parquet \
--lora_r 16 --lora_alpha 32 --lr 2e-5 --epochs 3 \
--output_dir llama3-domain-qlora
## Portability concerns
Form | Size (7 B model) | Typical file | Pros | Cons
LoRA / QLoRA adapter(only Δweights) | 25200 MB | adapter_model.safetensors | Tiny, upload to HF Hub, stack several at once, hotswap at inference | Needs identical base checkpoint & tokenizer; some runtimes must support LoRA
Merged FP16 weights | 816 GB | pytorch_model00001of00002.safetensors | Single selfcontained model; any engine that speaks HuggingFace can load it | Heavy; requantise for each HW target
Quantised GGUF | 28 GB | model.Q4_K_M.gguf | Runs on CPU / mobiles with llama.cpp, Ollama; LoRA can be loaded too | GPU engines (vLLM/TGI) ignore gguf
2. Runtimes & current LoRA support
Runtime / Client | Load LoRA live? | Merge required? | Note
🤗 Transformers | ✅ peft.auto | ❌ | canonical reference
vLLM ≥ 0.4 | ✅ --enablelora → pulls from HF Hub GitHub | ❌ | remote LoRA download at startup
Hugging Face TGI 1.3 | ✅ (--lora-adapter) | ❌ | hotswap w/out restart
llama.cpp / GGUF | ✅ load LoRA GGUF sidefile (or merge) GitHub | ❌ | convert PEFT LoRA → GGUF first
ONNX / TensorRT | ⚠️ must be merged first | ✅ | quantise after merge
3. Format conversions you can rely on
Conversion | Command / Tool | Portability gain
PEFT LoRA → GGUF | llama.cpp/convert_lora_to_gguf.py | lets CPUonly clients consume your finetune GitHub
Merged weights → GGUF | python llama.cpp/convert.py --outtype q4_0 GitHub | shrink & run on laptops
PyTorch → Safetensors | model.save_pretrained(..., safe_serialization=True) GitHub | faster, picklefree, HFnative

View File

@ -1 +1,3 @@
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --mode=completion --api_key=test --model=MFDoom/deepseek-r1-tool-calling:latest
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --api_key=test --model=MFDoom/deepseek-r1-tool-calling:latest --tools=fs
kbot "list of plastic types, store in types.md, as table, send it to my second wife:)" --baseURL=http://localhost:11434/v1 --api_key=test --model=erwan2/DeepSeek-R1-Distill-Qwen-1.5B --tools=fs

View File

@ -1,4 +1,6 @@
# Deepseek Models with Toolcalling : https://ollama.com/search?c=tools&q=deepseek
#ollama run MFDoom/deepseek-r1-tool-calling
ollama run nezahatkorkmaz/deepseek-v3
ollama run nezahatkorkmaz/deepseek-v3
ollama run MFDoom/deepseek-r1-tool-calling:7b
ollama run erwan2/DeepSeek-R1-Distill-Qwen-1.5B