mono-cpp/polymech.md
2026-03-28 13:11:29 +01:00

321 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Polymech C++ Gridsearch Worker — Design
## Goal
Port the [gridsearch-worker.ts](../src/products/locations/gridsearch-worker.ts) pipeline to native C++, running as a **CLI subcommand** (`polymech-cli gridsearch`) while keeping all logic in internal libraries under `packages/`. The worker communicates progress via the [IPC framing protocol](./packages/ipc/) and writes results to Supabase via the existing [postgres](./packages/postgres/) package.
---
## Status
| Package | Status | Tests | Assertions |
|---------|--------|-------|------------|
| `geo` | ✅ Done | 23 | 77 |
| `gadm_reader` | ✅ Done | 18 | 53 |
| `grid` | ✅ Done | 13 | 105 |
| `search` | ✅ Done | 8 | 13 |
| CLI `gridsearch` | ✅ Done | — | dry-run verified (3ms) |
| IPC `gridsearch` | ✅ Done | 1 | 30 |
| **Total** | | **63** | **278** |
---
## Existing C++ Inventory
| Package | Provides |
|---------|----------|
| `ipc` | Length-prefixed JSON over stdio |
| `postgres` | Supabase PostgREST: `query`, `insert` |
| `http` | libcurl `GET`/`POST` |
| `json` | RapidJSON validate/prettify |
| `logger` | spdlog (stdout or **stderr** in worker mode) |
| `html` | HTML parser |
---
## TypeScript Pipeline (Reference)
```
GADM Resolve → Grid Generate → SerpAPI Search → Enrich → Supabase Upsert
```
| Phase | Input | Output | Heavy work |
|-------|-------|--------|------------|
| **1. GADM Resolve** | GID list + target level | `GridFeature[]` (GeoJSON polygons with GHS props) | Read pre-cached JSON files from `cache/gadm/boundary_{GID}_{LEVEL}.json` |
| **2. Grid Generate** | `GridFeature[]` + settings | `GridSearchHop[]` (waypoints: lat/lng/radius) | Centroid, bbox, distance, area, point-in-polygon, cell sorting |
| **3. Search** | Waypoints + query + SerpAPI key | Place results (JSON) | HTTP calls to `serpapi.com`, per-waypoint caching |
| **4. Enrich** | Place results | Enriched data (emails, pages) | HTTP scraping |
| **5. Persist** | Enriched places | Supabase `places` + `grid_search_runs` | PostgREST upsert |
---
## Implemented Packages
### 1. `packages/geo` — Geometry primitives ✅
Header + `.cpp`, no external deps. Implements the **turf.js subset** used by the grid generator.
```cpp
namespace geo {
struct Coord { double lon, lat; };
struct BBox { double minLon, minLat, maxLon, maxLat; };
BBox bbox(const std::vector<Coord>& ring);
Coord centroid(const std::vector<Coord>& ring);
double area_sq_m(const std::vector<Coord>& ring);
double distance_km(Coord a, Coord b);
bool point_in_polygon(Coord pt, const std::vector<Coord>& ring);
std::vector<BBox> square_grid(BBox extent, double cellSizeKm);
std::vector<BBox> hex_grid(BBox extent, double cellSizeKm);
std::vector<Coord> buffer_circle(Coord center, double radiusKm, int steps = 6);
} // namespace geo
```
**Rationale**: ~200 lines avoids pulling GEOS/Boost.Geometry. Adopts `pip.h` ray-casting pattern from `packages/gadm/cpp/` without the GDAL/GEOS/PROJ dependency (~700MB).
---
### 2. `packages/gadm_reader` — Boundary resolver ✅
Reads pre-cached GADM boundary JSON from disk. No network calls.
```cpp
namespace gadm {
struct Feature {
std::string gid, name;
int level;
std::vector<std::vector<geo::Coord>> rings;
double ghsPopulation, ghsBuiltWeight;
geo::Coord ghsPopCenter, ghsBuiltCenter;
std::vector<std::array<double, 3>> ghsPopCenters; // [lon, lat, weight]
std::vector<std::array<double, 3>> ghsBuiltCenters;
double areaSqKm;
};
BoundaryResult load_boundary(const std::string& gid, int targetLevel,
const std::string& cacheDir = "cache/gadm");
} // namespace gadm
```
Handles `Polygon`/`MultiPolygon`, GHS enrichment fields, fallback resolution by country code prefix.
---
### 3. `packages/grid` — Grid generator ✅
Direct port of [grid-generator.ts](../../shared/src/products/places/grid-generator.ts).
```cpp
namespace grid {
struct Waypoint { int step; double lng, lat, radius_km; };
struct GridOptions {
std::string gridMode; // "hex", "square", "admin", "centers"
double cellSize; // km
double cellOverlap, centroidOverlap;
int maxCellsLimit;
double maxElevation, minDensity, minGhsPop, minGhsBuilt;
std::string ghsFilterMode; // "AND" | "OR"
bool allowMissingGhs, bypassFilters;
std::string pathOrder; // "zigzag", "snake", "spiral-out", "spiral-in", "shortest"
bool groupByRegion;
};
struct GridResult { std::vector<Waypoint> waypoints; int validCells, skippedCells; std::string error; };
GridResult generate(const std::vector<gadm::Feature>& features, const GridOptions& opts);
} // namespace grid
```
**4 modes**: `admin` (centroid + radius), `centers` (GHS deduplicated), `hex`, `square` (tessellation + PIP)
**5 sort algorithms**: `zigzag`, `snake`, `spiral-out`, `spiral-in`, `shortest` (greedy NN)
---
### 4. `packages/search` — SerpAPI client + config ✅
```cpp
namespace search {
struct Config {
std::string serpapi_key, geocoder_key, bigdata_key;
std::string postgres_url, supabase_url, supabase_service_key;
};
Config load_config(const std::string& path = "config/postgres.toml");
struct SearchOptions {
std::string query;
double lat, lng;
int zoom = 13, limit = 20;
std::string engine = "google_maps", hl = "en", google_domain = "google.com";
};
struct MapResult {
std::string title, place_id, data_id, address, phone, website, type;
std::vector<std::string> types;
double rating; int reviews;
GpsCoordinates gps;
};
SearchResult search_google_maps(const Config& cfg, const SearchOptions& opts);
} // namespace search
```
Reads `[services].SERPAPI_KEY`, `GEO_CODER_KEY`, `BIG_DATA_KEY` from `config/postgres.toml`. HTTP pagination via `http::get()`, JSON parsing with RapidJSON.
---
## CLI Subcommands ✅
### 1. `gridsearch` (One-shot execution)
```
polymech-cli gridsearch <GID> <QUERY> [OPTIONS]
Positionals:
GID GADM GID (e.g. ESP.1.1_1) — ignored when --settings is used
QUERY Search query — ignored when --settings is used
Options:
-l, --level INT Target GADM level (default: 0)
-m, --mode TEXT Grid mode: hex|square|admin|centers (default: hex)
-s, --cell-size FLOAT Cell size in km (default: 5.0)
--limit INT Max results per area (default: 20)
-z, --zoom INT Google Maps zoom (default: 13)
--sort TEXT Path order: snake|zigzag|spiral-out|spiral-in|shortest
-c, --config TEXT TOML config path (default: config/postgres.toml)
--cache-dir TEXT GADM cache directory (default: cache/gadm)
--settings TEXT JSON settings file (matches TypeScript GuidedPreset shape)
--enrich Run enrichment pipeline (meta + email) after search
--persistence-postgres Persist run data natively via Postgres
-o, --output TEXT Output JSON file (default: gridsearch-HH-MM.json in cwd)
--dry-run Generate grid only, skip SerpAPI search
```
### 2. `worker` (IPC Daemon execution)
```
polymech-cli worker [OPTIONS]
Options:
--daemon Run persistent daemon pool (tier-based)
-c, --config TEXT TOML config path (default: config/postgres.toml)
--user-uid TEXT User ID to bind this daemon to (needed for place owner)
--uds TEXT Run over Unix Domain Socket / Named Pipe (TCP on Windows) at the given path
```
### Execution flow
```
1. load_config(configPath) → Config (TOML)
2. gadm::load_boundary(gid, level) → features[]
3. grid::generate(features, opts) → waypoints[]
4. --dry-run → output JSON array and exit
5. For each waypoint → search::search_google_maps(cfg, sopts)
6. Stream JSON summary to stdout
```
### Example
```bash
polymech-cli gridsearch ABW "recycling" --dry-run
# → [{"step":1,"lat":12.588582,"lng":-70.040465,"radius_km":3.540}, ...]
# [info] Dry-run complete in 3ms
```
### IPC worker mode
The `worker` subcommand natively routes multiplexed asynchronous `gridsearch` payloads. When launched via `--uds <path>`, it provisions a high-performance Asio streaming server (AF_UNIX sockets on POSIX, TCP sockets on Windows). Event frames (`grid-ready`, `waypoint-start`, `location`, `node`, etc) emit bi-directionally utilizing the IPC bridging protocol, dropping locking blockades completely.
---
## Exposed Configuration / Tuning Parameters
As we integrate deeper with the core business logic, the Node orchestrator and internal services should configure and enforce limits on the underlying C++ concurrent engine. Relevant configuration surfaces we need to expose for the primary ecosystem libraries include:
### 1. Taskflow (`https://github.com/taskflow/taskflow`)
- **`executor_threads` (`num_workers`)**: The size of the `tf::Executor` thread pool. As Gridsearch is heavily I/O network bound (HTTP calls for search/enrichment), setting this significantly higher than `std::thread::hardware_concurrency()` may aggressively improve HTTP ingestion throughput globally.
- **`max_concurrent_jobs_per_user`**: A structural limit dictating how many concurrent gridsearch invocation graphs a single tenant/user can enqueue and run actively to prevent monopolization.
- **`http_concurrency_throttle`**: Task limits enforced upon node scraping or SerpAPI requests per-pipeline graph to avoid widespread `429 Too Many Requests` bans.
### 2. Moodycamel ConcurrentQueue (`https://github.com/cameron314/concurrentqueue`)
- **`queue_depth_max` / `backpressure`**: Since Moodycamel queue memory allocates dynamically and lock-free to any capacity, we must mandate a hard software ceiling/backpressure limit over the Node-to-C++ IPC layer. If Node blindly streams jobs faster than Taskflow can execute them, the daemon will eventually OOM.
- **`bulk_dequeue_size`**: Exposing tuning parameters for the dispatch thread on how many concurrent IPC tasks should be sucked out of the queue simultaneously.
### 3. Boost.Asio (`https://github.com/chriskohlhoff/asio`)
- **`ipc_timeout_ms` (Read/Write)**: Mandatory timeouts for the IPC socket layer. If the orchestrator stalls, crashes, or goes silent, Asio must reap the connection and automatically GC the in-flight tasks to prevent Zombie worker processes.
- **`max_ipc_connections`**: Absolute limit on simultaneous orchestration pipelines dialing into a single Worker Pod.
- **`buffer_size_max`**: Soft constraints on async payload allocations so a malformed 200MB JSON frame from Node.js doesn't memory-spike the `asio::read` operations abruptly.
---
## Build Integration
### Dependency graph
```
┌──────────┐
│ polymech │ (the lib)
│ -cli │ (the binary)
└────┬─────┘
┌────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ search │ │ grid │ │ ipc │
└────┬─────┘ └────┬─────┘ └──────────┘
│ │
▼ ▼
┌──────────┐ ┌───────────────┐
│ http │ │ gadm_reader │
└──────────┘ └────┬──────────┘
┌──────────┐
│ geo │ ← no deps (math only)
└──────────┘
┌──────────┐
│ json │ ← RapidJSON
└──────────┘
```
All packages depend on `logger` and `json` implicitly.
---
## Testing
### Unit tests (Catch2) — 62 tests, 248 assertions ✅
| Test file | Tests | Assertions | Validates |
|-----------|-------|------------|-----------|
| `test_geo.cpp` | 23 | 77 | Haversine, area, centroid, PIP, hex/square grid |
| `test_gadm_reader.cpp` | 18 | 53 | JSON parsing, GHS props, fallback resolution |
| `test_grid.cpp` | 13 | 105 | All 4 modes × 5 sorts, GHS filtering, PIP clipping |
| `test_search.cpp` | 8 | 13 | Config loading, key validation, error handling |
### Integration test (Node.js)
- Existing `orchestrator/test-ipc.mjs` validates spawn/lifecycle/ping/job
- `orchestrator/test-gridsearch-ipc.mjs` validates full pipeline via IPC (8 event types + job result)
- `orchestrator/test-gridsearch-ipc-uds.mjs` validates high-throughput Unix Domain Sockets mapping, backpressure boundaries, and soft cancellation injections utilizing `action: cancel` frames mid-flight.
---
## IPC Cancellation & Dynamic Job Tuning
The high-performance UDS daemon now natively tracks and intercepts JSON `action: cancel` frames referencing specific `jobId`s to gracefully exit Taskflow jobs mid-flight.
Dynamic tuning limits, such as memory buffering boundaries or threading capacities, are inherently validated and bound by hard ceilings established inside the `[system]` constraint block of `config/postgres.toml`.
---
## Deferred (Phase 2)
| Item | Reason |
|------|--------|
| SerpAPI response caching | State store managed by orchestrator for now |
| Protobuf framing | JSON IPC sufficient for current throughput |
| Multi-threaded search | Sequential is fine for SerpAPI rate limits |
| GEOS integration | Custom geo is sufficient for grid math |