2026-03-28 13:11:29 +01:00

13 KiB

Raw Blame History

Polymech C++ Gridsearch Worker — Design

Goal

Port the gridsearch-worker.ts pipeline to native C++, running as a CLI subcommand (polymech-cli gridsearch) while keeping all logic in internal libraries under packages/. The worker communicates progress via the IPC framing protocol and writes results to Supabase via the existing postgres package.

Status

Package	Status	Tests	Assertions
`geo`	✅ Done	23	77
`gadm_reader`	✅ Done	18	53
`grid`	✅ Done	13	105
`search`	✅ Done	8	13
CLI `gridsearch`	✅ Done	—	dry-run verified (3ms)
IPC `gridsearch`	✅ Done	1	30
Total		63	278

Existing C++ Inventory

Package	Provides
`ipc`	Length-prefixed JSON over stdio
`postgres`	Supabase PostgREST: `query`, `insert`
`http`	libcurl `GET`/`POST`
`json`	RapidJSON validate/prettify
`logger`	spdlog (stdout or stderr in worker mode)
`html`	HTML parser

TypeScript Pipeline (Reference)

GADM Resolve → Grid Generate → SerpAPI Search → Enrich → Supabase Upsert

Phase	Input	Output	Heavy work
1. GADM Resolve	GID list + target level	`GridFeature[]` (GeoJSON polygons with GHS props)	Read pre-cached JSON files from `cache/gadm/boundary_{GID}_{LEVEL}.json`
2. Grid Generate	`GridFeature[]` + settings	`GridSearchHop[]` (waypoints: lat/lng/radius)	Centroid, bbox, distance, area, point-in-polygon, cell sorting
3. Search	Waypoints + query + SerpAPI key	Place results (JSON)	HTTP calls to `serpapi.com`, per-waypoint caching
4. Enrich	Place results	Enriched data (emails, pages)	HTTP scraping
5. Persist	Enriched places	Supabase `places` + `grid_search_runs`	PostgREST upsert

Implemented Packages

1. `packages/geo` — Geometry primitives ✅

Header + .cpp, no external deps. Implements the turf.js subset used by the grid generator.

namespace geo {

struct Coord { double lon, lat; };
struct BBox  { double minLon, minLat, maxLon, maxLat; };

BBox   bbox(const std::vector<Coord>& ring);
Coord  centroid(const std::vector<Coord>& ring);
double area_sq_m(const std::vector<Coord>& ring);
double distance_km(Coord a, Coord b);
bool   point_in_polygon(Coord pt, const std::vector<Coord>& ring);

std::vector<BBox> square_grid(BBox extent, double cellSizeKm);
std::vector<BBox> hex_grid(BBox extent, double cellSizeKm);
std::vector<Coord> buffer_circle(Coord center, double radiusKm, int steps = 6);
} // namespace geo

Rationale: ~200 lines avoids pulling GEOS/Boost.Geometry. Adopts pip.h ray-casting pattern from packages/gadm/cpp/ without the GDAL/GEOS/PROJ dependency (~700MB).

2. `packages/gadm_reader` — Boundary resolver ✅

Reads pre-cached GADM boundary JSON from disk. No network calls.

namespace gadm {

struct Feature {
    std::string gid, name;
    int level;
    std::vector<std::vector<geo::Coord>> rings;
    double ghsPopulation, ghsBuiltWeight;
    geo::Coord ghsPopCenter, ghsBuiltCenter;
    std::vector<std::array<double, 3>> ghsPopCenters;   // [lon, lat, weight]
    std::vector<std::array<double, 3>> ghsBuiltCenters;
    double areaSqKm;
};

BoundaryResult load_boundary(const std::string& gid, int targetLevel,
                             const std::string& cacheDir = "cache/gadm");
} // namespace gadm

Handles Polygon/MultiPolygon, GHS enrichment fields, fallback resolution by country code prefix.

3. `packages/grid` — Grid generator ✅

Direct port of grid-generator.ts.

namespace grid {

struct Waypoint { int step; double lng, lat, radius_km; };
struct GridOptions {
    std::string gridMode;      // "hex", "square", "admin", "centers"
    double cellSize;           // km
    double cellOverlap, centroidOverlap;
    int maxCellsLimit;
    double maxElevation, minDensity, minGhsPop, minGhsBuilt;
    std::string ghsFilterMode; // "AND" | "OR"
    bool allowMissingGhs, bypassFilters;
    std::string pathOrder;     // "zigzag", "snake", "spiral-out", "spiral-in", "shortest"
    bool groupByRegion;
};
struct GridResult { std::vector<Waypoint> waypoints; int validCells, skippedCells; std::string error; };

GridResult generate(const std::vector<gadm::Feature>& features, const GridOptions& opts);
} // namespace grid

4 modes: admin (centroid + radius), centers (GHS deduplicated), hex, square (tessellation + PIP) 5 sort algorithms: zigzag, snake, spiral-out, spiral-in, shortest (greedy NN)

4. `packages/search` — SerpAPI client + config ✅

namespace search {

struct Config {
    std::string serpapi_key, geocoder_key, bigdata_key;
    std::string postgres_url, supabase_url, supabase_service_key;
};

Config load_config(const std::string& path = "config/postgres.toml");

struct SearchOptions {
    std::string query;
    double lat, lng;
    int zoom = 13, limit = 20;
    std::string engine = "google_maps", hl = "en", google_domain = "google.com";
};

struct MapResult {
    std::string title, place_id, data_id, address, phone, website, type;
    std::vector<std::string> types;
    double rating; int reviews;
    GpsCoordinates gps;
};

SearchResult search_google_maps(const Config& cfg, const SearchOptions& opts);
} // namespace search

Reads [services].SERPAPI_KEY, GEO_CODER_KEY, BIG_DATA_KEY from config/postgres.toml. HTTP pagination via http::get(), JSON parsing with RapidJSON.

CLI Subcommands ✅

1. `gridsearch` (One-shot execution)

polymech-cli gridsearch <GID> <QUERY> [OPTIONS]

Positionals:
  GID                   GADM GID (e.g. ESP.1.1_1) — ignored when --settings is used
  QUERY                 Search query — ignored when --settings is used

Options:
  -l, --level INT       Target GADM level (default: 0)
  -m, --mode TEXT       Grid mode: hex|square|admin|centers (default: hex)
  -s, --cell-size FLOAT Cell size in km (default: 5.0)
  --limit INT           Max results per area (default: 20)
  -z, --zoom INT        Google Maps zoom (default: 13)
  --sort TEXT           Path order: snake|zigzag|spiral-out|spiral-in|shortest
  -c, --config TEXT     TOML config path (default: config/postgres.toml)
  --cache-dir TEXT      GADM cache directory (default: cache/gadm)
  --settings TEXT       JSON settings file (matches TypeScript GuidedPreset shape)
  --enrich              Run enrichment pipeline (meta + email) after search
  --persistence-postgres Persist run data natively via Postgres
  -o, --output TEXT     Output JSON file (default: gridsearch-HH-MM.json in cwd)
  --dry-run             Generate grid only, skip SerpAPI search

2. `worker` (IPC Daemon execution)

polymech-cli worker [OPTIONS]

Options:
  --daemon              Run persistent daemon pool (tier-based)
  -c, --config TEXT     TOML config path (default: config/postgres.toml)
  --user-uid TEXT       User ID to bind this daemon to (needed for place owner)
  --uds TEXT            Run over Unix Domain Socket / Named Pipe (TCP on Windows) at the given path

Execution flow

1. load_config(configPath)               → Config (TOML)
2. gadm::load_boundary(gid, level)       → features[]
3. grid::generate(features, opts)        → waypoints[]
4. --dry-run → output JSON array and exit
5. For each waypoint → search::search_google_maps(cfg, sopts)
6. Stream JSON summary to stdout

Example

polymech-cli gridsearch ABW "recycling" --dry-run
# → [{"step":1,"lat":12.588582,"lng":-70.040465,"radius_km":3.540}, ...]
# [info] Dry-run complete in 3ms

IPC worker mode

The worker subcommand natively routes multiplexed asynchronous gridsearch payloads. When launched via --uds <path>, it provisions a high-performance Asio streaming server (AF_UNIX sockets on POSIX, TCP sockets on Windows). Event frames (grid-ready, waypoint-start, location, node, etc) emit bi-directionally utilizing the IPC bridging protocol, dropping locking blockades completely.

Exposed Configuration / Tuning Parameters

As we integrate deeper with the core business logic, the Node orchestrator and internal services should configure and enforce limits on the underlying C++ concurrent engine. Relevant configuration surfaces we need to expose for the primary ecosystem libraries include:

1. Taskflow (`https://github.com/taskflow/taskflow`)

executor_threads (num_workers): The size of the tf::Executor thread pool. As Gridsearch is heavily I/O network bound (HTTP calls for search/enrichment), setting this significantly higher than std::thread::hardware_concurrency() may aggressively improve HTTP ingestion throughput globally.
max_concurrent_jobs_per_user: A structural limit dictating how many concurrent gridsearch invocation graphs a single tenant/user can enqueue and run actively to prevent monopolization.
http_concurrency_throttle: Task limits enforced upon node scraping or SerpAPI requests per-pipeline graph to avoid widespread 429 Too Many Requests bans.

2. Moodycamel ConcurrentQueue (`https://github.com/cameron314/concurrentqueue`)

queue_depth_max / backpressure: Since Moodycamel queue memory allocates dynamically and lock-free to any capacity, we must mandate a hard software ceiling/backpressure limit over the Node-to-C++ IPC layer. If Node blindly streams jobs faster than Taskflow can execute them, the daemon will eventually OOM.
bulk_dequeue_size: Exposing tuning parameters for the dispatch thread on how many concurrent IPC tasks should be sucked out of the queue simultaneously.

3. Boost.Asio (`https://github.com/chriskohlhoff/asio`)

ipc_timeout_ms (Read/Write): Mandatory timeouts for the IPC socket layer. If the orchestrator stalls, crashes, or goes silent, Asio must reap the connection and automatically GC the in-flight tasks to prevent Zombie worker processes.
max_ipc_connections: Absolute limit on simultaneous orchestration pipelines dialing into a single Worker Pod.
buffer_size_max: Soft constraints on async payload allocations so a malformed 200MB JSON frame from Node.js doesn't memory-spike the asio::read operations abruptly.

Build Integration

Dependency graph

                  ┌──────────┐
                  │ polymech │ (the lib)
                  │  -cli    │ (the binary)
                  └────┬─────┘
          ┌────────────┼────────────────┐
          ▼            ▼                ▼
    ┌──────────┐ ┌──────────┐    ┌──────────┐
    │  search  │ │   grid   │    │   ipc    │
    └────┬─────┘ └────┬─────┘    └──────────┘
         │            │
         ▼            ▼
    ┌──────────┐ ┌───────────────┐
    │   http   │ │  gadm_reader  │
    └──────────┘ └────┬──────────┘
                      ▼
                 ┌──────────┐
                 │   geo    │ ← no deps (math only)
                 └──────────┘
                 ┌──────────┐
                 │   json   │ ← RapidJSON
                 └──────────┘

All packages depend on logger and json implicitly.

Testing

Unit tests (Catch2) — 62 tests, 248 assertions ✅

Test file	Tests	Assertions	Validates
`test_geo.cpp`	23	77	Haversine, area, centroid, PIP, hex/square grid
`test_gadm_reader.cpp`	18	53	JSON parsing, GHS props, fallback resolution
`test_grid.cpp`	13	105	All 4 modes × 5 sorts, GHS filtering, PIP clipping
`test_search.cpp`	8	13	Config loading, key validation, error handling

Integration test (Node.js)

Existing orchestrator/test-ipc.mjs validates spawn/lifecycle/ping/job
orchestrator/test-gridsearch-ipc.mjs validates full pipeline via IPC (8 event types + job result)
orchestrator/test-gridsearch-ipc-uds.mjs validates high-throughput Unix Domain Sockets mapping, backpressure boundaries, and soft cancellation injections utilizing action: cancel frames mid-flight.

IPC Cancellation & Dynamic Job Tuning

The high-performance UDS daemon now natively tracks and intercepts JSON action: cancel frames referencing specific jobIds to gracefully exit Taskflow jobs mid-flight. Dynamic tuning limits, such as memory buffering boundaries or threading capacities, are inherently validated and bound by hard ceilings established inside the [system] constraint block of config/postgres.toml.

Deferred (Phase 2)

Item	Reason
SerpAPI response caching	State store managed by orchestrator for now
Protobuf framing	JSON IPC sufficient for current throughput
Multi-threaded search	Sequential is fine for SerpAPI rate limits
GEOS integration	Custom geo is sufficient for grid math

13 KiB Raw Blame History Unescape Escape