gadm-ts/docs/cpp-port.md
2026-03-23 16:57:01 +01:00

8.2 KiB
Raw Blame History

C++ Port — Boundaries Batch Generator

Port the boundaries.ts pipeline to native C++ for 10100× speedup on geometry union + GHS raster enrichment.

Note

Scaffolding is complete — CLI entry point, module stubs, vcpkg manifest, and CMake build are in place under cpp/. The geo_merge and ghs_enrich modules are stubs pending implementation.


Why C++

Bottleneck (TS) Root cause C++ remedy
polygon-clipping.union Pure-JS, throws on complex polys (DEU) GEOS GEOSUnaryUnion — battle-tested C engine
enrichFeatureWithGHS PIP loop GC pressure, obj alloc per pixel Stack-allocated ray-cast, zero alloc
readRasters (GeoTIFF) JS decoder, no tiling GDAL RasterIO — native COG/tiled reads
proj4 per pixel JS re-init overhead PROJ C API, reuse PJ* transform handle

Dependencies

Library Role Notes
GDAL/OGR 3.x GeoPackage read, GeoTIFF read, coordinate transforms Single dep for 3 concerns
GEOS 3.12+ GEOSUnaryUnion, simplify, valid geometry repair Used internally by GDAL
PROJ Coordinate transforms (Mollweide → WGS84) proj_trans with cached PJ* handle
nlohmann/json JSON output (cache files) Header-only
CLI11 Argument parsing FetchContent (not vcpkg)
SQLite3 Already bundled with GDAL; direct access if needed

Install via vcpkg manifest (vcpkg.json is in cpp/):

cd cpp
vcpkg install

Architecture

cpp/
├── CMakeLists.txt          # Project root — GDAL/GEOS/PROJ via vcpkg
├── CMakePresets.json       # vcpkg, vcpkg-win, dev presets
├── vcpkg.json              # Dependency manifest
├── packages/
│   ├── logger/             # Logging utility (kept from boilerplate)
│   └── json/               # JSON utility (kept from boilerplate)
├── src/
│   ├── main.cpp            # CLI entry, country loop, cache skip logic
│   ├── gpkg_reader.h/cpp   # GeoPackage → features (OGR)
│   ├── geo_merge.h/cpp     # Geometry union (GEOS) — stub
│   ├── ghs_enrich.h/cpp    # GeoTIFF raster sampling + PIP — stub
│   ├── pip.h               # Inline ray-casting point-in-polygon
│   └── types.h             # BoundaryFeature, BoundaryResult structs
└── tests/
    └── unit/test_pip.cpp   # Catch2 PIP unit test

Module Mapping (TS → C++)

1. gpkg_reader — GeoPackage Reading

TS source: gpkg-reader.ts

  • Replace custom WKB parser + better-sqlite3 with OGR GDALOpenEx + OGR_L_GetNextFeature
  • OGR natively reads GeoPackage and returns OGRGeometry* objects — no manual WKB parsing
  • Group features by GID_{level} column, collect geometries per group
GDALDatasetH ds = GDALOpenEx("gadm41.gpkg", GDAL_OF_VECTOR, nullptr, nullptr, nullptr);
OGRLayerH layer = GDALDatasetGetLayer(ds, 0);
OGR_L_SetAttributeFilter(layer, "GID_0 = 'DEU'");
OGRFeatureH feat;
while ((feat = OGR_L_GetNextFeature(layer)) != nullptr) {
    OGRGeometryH geom = OGR_F_GetGeometryRef(feat);
    // group by GID_{level}, collect geom clones
}

2. geo_merge — Geometry Union

TS source: mergeGeometries() in gpkg-reader.ts

  • Replace polygon-clipping + pairwise fallback with GEOSUnaryUnion
  • GEOS handles all edge cases (self-intersections, degeneracies) that crash JS libs
  • For invalid inputs: GEOSMakeValid → then union
GEOSGeometry* collection = GEOSGeom_createCollection(
    GEOS_GEOMETRYCOLLECTION, geoms.data(), geoms.size());
GEOSGeometry* merged = GEOSUnaryUnion(collection);
if (!merged) {
    // MakeValid fallback
    for (auto& g : geoms) g = GEOSMakeValid(g);
    merged = GEOSUnaryUnion(collection);
}

3. ghs_enrich — GeoTIFF Raster Sampling

TS source: enrich-ghs.ts

  • Replace geotiff.js with GDAL GDALRasterIO — supports tiled/COG reads, much faster
  • Replace JS proj4() per-pixel with PROJ proj_trans using a cached PJ* handle
  • Replace Turf PIP with inline ray-casting (already done in TS, trivial in C++)
  • Keep stride-based sampling for large windows
// One-time setup
PJ* transform = proj_create_crs_to_crs(ctx, "EPSG:54009", "EPSG:4326", nullptr);

// Read raster window
GDALRasterBandH band = GDALGetRasterBand(ds, 1);
std::vector<float> buf(width * height);
GDALRasterIO(band, GF_Read, minPx, minPy, width, height,
             buf.data(), width, height, GDT_Float32, 0, 0);

// Fast pixel loop — no allocations
for (int y = 0; y < height; y += stride) {
    for (int x = 0; x < width; x += stride) {
        float val = buf[y * width + x];
        if (val <= 0 || val == 65535.f) continue;
        PJ_COORD in = proj_coord(originX + (minPx+x)*resX, originY + (minPy+y)*resY, 0, 0);
        PJ_COORD out = proj_trans(transform, PJ_FWD, in);
        if (point_in_geometry(out.xy.x, out.xy.y, geom)) {
            totalVal += val * strideFactor;
            // ... weighted sums, candidate tracking
        }
    }
}

4. main.cpp — CLI + Cache Logic

TS source: boundaries.ts

  • Read gadm_continent.json with nlohmann/json
  • Same skip-if-cached logic (check file existence)
  • Output identical JSON format to cache/gadm/boundary_{code}_{level}.json
  • Optional: OpenMP #pragma omp parallel for on the country loop (each country is independent)

Expected Performance

Operation TS (current) C++ (expected) Speedup
Geometry union (DEU L0, 151 polys) Crashes / >10min ~13s ∞ / ~100×
GHS enrich (NGA L0, ~6M px window) ~210s ~515s ~15×
Full batch (all countries, all levels) Hours, may hang ~2040 min ~10×

Build & Integration

Tip

gdal pulls in proj and geos transitively on most triplets, but listing them explicitly in vcpkg.json ensures the CMake targets are always available.

Build & Run

# 1. Install dependencies (one-time, from cpp/ directory)
cd cpp
vcpkg install

# 2. Configure + build (Windows MSVC)
cmake --preset vcpkg-win
cmake --build build --config Release

# 2b. Or Ninja (Linux/macOS)
cmake --preset vcpkg
cmake --build build --config Release

# 3. Run
./build/Release/boundaries --country=all
./build/Release/boundaries --country=DEU --level=0
./build/Release/boundaries --country=NGA --force

CLI Flags

Flag Default Description
--country all ISO3 country code, or all for batch
--level -1 (all) Admin level 05, or -1 for all levels
--cache-dir cache/gadm Output directory for boundary JSON files
--gpkg data/gadm_410-levels.gpkg Path to GADM GeoPackage
--force off Regenerate even if cache file exists

Output goes to the same cache/gadm/ directory — the Node.js server reads these cache files at runtime, so the C++ tool is a drop-in replacement for batch generation.


Implementation Status

Module Status Notes
main.cpp Done CLI harness, cache logic, JSON output, OpenMP
gpkg_reader Scaffolded OGR read + feature grouping — needs GEOS geometry collection
geo_merge Done GEOSUnaryUnion + MakeValid fallback, WKB lossless roundtrip
ghs_enrich Done GDAL windowed RasterIO + PROJ + OGR PIP + peak detection
pip.h Done Inline ray-casting, unit-tested
types.h Done BoundaryFeature, BoundaryResult structs

Next Steps

  1. geo_merge — Wire up GEOS: geometry collection → GEOSUnaryUnion → GeoJSON export
  2. gpkg_reader — Clone OGR geometries into GEOS handles (instead of direct GeoJSON export)
  3. ghs_enrich — GDAL raster read + PROJ transform + PIP loop
  4. Validation — Compare output against existing TS-generated cache files
  5. Parallelism — OpenMP on country loop (already wired, just needs #pragma testing)

Each module can be tested independently against the existing JSON cache files for correctness.