167 lines
7.9 KiB
Markdown
167 lines
7.9 KiB
Markdown
# gadm-boundaries (C++)
|
|
|
|
Native C++ boundaries batch generator for the GADM pipeline.
|
|
Uses GDAL/GEOS/PROJ for geometry union + GHS raster enrichment (population & built-up surface).
|
|
|
|
## Prerequisites
|
|
|
|
| Tool | Version |
|
|
|------|---------|
|
|
| CMake | >= 3.20 |
|
|
| C++ compiler | C++20 (MSVC, GCC, or Clang) |
|
|
| vcpkg | Latest (set `VCPKG_ROOT` env var) |
|
|
|
|
vcpkg dependencies (auto-installed via `vcpkg.json`): GDAL, GEOS, PROJ, nlohmann-json, CLI11, spdlog, Catch2.
|
|
|
|
## Build
|
|
|
|
### Windows (MSVC)
|
|
|
|
```bash
|
|
cd packages/gadm/cpp
|
|
cmake --preset vcpkg-win
|
|
cmake --build build --config Release
|
|
```
|
|
|
|
Output: `dist/win-x64/boundaries.exe` + runtime DLLs + `proj.db`
|
|
|
|
### Linux (Ubuntu)
|
|
|
|
```bash
|
|
# System dependencies (build tools)
|
|
sudo apt install cmake ninja-build pkg-config g++ git curl zip unzip tar
|
|
|
|
# Install vcpkg (if not already)
|
|
git clone https://github.com/microsoft/vcpkg.git ~/vcpkg
|
|
~/vcpkg/bootstrap-vcpkg.sh
|
|
export VCPKG_ROOT=~/vcpkg
|
|
|
|
# Build
|
|
cd packages/gadm/cpp
|
|
cmake --preset vcpkg-linux
|
|
cmake --build build --config Release
|
|
```
|
|
|
|
Output: `dist/linux-x64/boundaries` + `proj.db`
|
|
|
|
> **Note:** First build takes 10-20 min as vcpkg compiles GDAL/GEOS/PROJ from source. Subsequent builds are fast.
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# From the gadm package root (not cpp/)
|
|
|
|
# All countries, all levels (263 countries x 6 levels)
|
|
.\dist\win-x64\boundaries.exe --country=all
|
|
|
|
# Single country, all levels
|
|
.\dist\win-x64\boundaries.exe --country=DEU
|
|
|
|
# Single country + level
|
|
.\dist\win-x64\boundaries.exe --country=DEU --level=0
|
|
|
|
# Sub-region specific (dotted GADM ID)
|
|
.\dist\win-x64\boundaries.exe --country=ESP.6_1 --level=4
|
|
|
|
# Batch sub-region generation for ALL countries
|
|
# Generates boundary_ESP.6_1_4.json, boundary_DEU.2_1_3.json, etc.
|
|
.\dist\win-x64\boundaries.exe --country=all --level=4 --split-levels=1
|
|
|
|
# Custom resolution (1=full detail, 10=max simplification, default=4)
|
|
.\dist\win-x64\boundaries.exe --country=DEU --resolution=6
|
|
|
|
# Force regeneration (ignore cached files)
|
|
.\dist\win-x64\boundaries.exe --country=NGA --force
|
|
```
|
|
|
|
### CLI Options
|
|
|
|
| Option | Default | Description |
|
|
|--------|---------|-------------|
|
|
| `--country` | `all` | ISO3 code, dotted GADM ID (e.g. `ESP.6_1`), or `all` for batch |
|
|
| `--level` | `-1` | Admin level 0-5, or -1 for all |
|
|
| `--split-levels` | `0` | Comma-separated list of levels to split output files by (e.g. `0,1`) |
|
|
| `--resolution` | `4` | Simplification resolution (1=full detail, 10=max simplification) |
|
|
| `--cache-dir` | `cache/gadm` | Output directory |
|
|
| `--gpkg` | `data/gadm_410-levels.gpkg` | GADM GeoPackage |
|
|
| `--continent-json` | `data/gadm_continent.json` | Continent mapping |
|
|
| `--pop-tiff` | `data/ghs/GHS_POP_...tif` | GHS population raster |
|
|
| `--built-tiff` | `data/ghs/GHS_BUILT_S_...tif` | GHS built-up raster |
|
|
| `--force` | `false` | Regenerate even if cached |
|
|
|
|
#### `--split-levels` explained
|
|
|
|
By default (`--split-levels=0`), the tool outputs one file per country per level: `boundary_ESP_4.json`.
|
|
|
|
With `--split-levels=1`, it dynamically queries all distinct level-1 region codes (e.g. `ESP.6_1`, `ESP.1_1`, …) and outputs individual files like `boundary_ESP.6_1_4.json`. This is useful for pre-splitting large countries into smaller, faster-loading cache files.
|
|
|
|
The TS wrapper automatically discovers these sub-region files before falling back to the full country file.
|
|
|
|
### npm Scripts
|
|
|
|
```bash
|
|
npm run build:cpp # rebuild the C++ binary
|
|
npm run boundaries:cpp # --country=all (full batch, Windows)
|
|
npm run boundaries:cpp -- --country=DEU # single country
|
|
```
|
|
|
|
On Linux, run the binary directly since the npm script points to the Windows path.
|
|
|
|
## Output
|
|
|
|
Files written to `cache/gadm/` as:
|
|
- `boundary_{CODE}_{LEVEL}.json` — full country file (e.g. `boundary_ESP_4.json`)
|
|
- `boundary_{GADM_ID}_{LEVEL}.json` — sub-region file when `--split-levels` is used (e.g. `boundary_ESP.6_1_4.json`)
|
|
|
|
The TS wrapper (`gpkg-reader.ts`) resolves cache files in this order:
|
|
1. Exact sub-region file: `boundary_{gadmId}_{level}.json`
|
|
2. Full country file: `boundary_{countryCode}_{level}.json` (prefix-filtered for sub-region queries)
|
|
3. Live GeoPackage query (fallback)
|
|
|
|
Each feature includes:
|
|
- `code` — admin region code
|
|
- `name` — admin region name
|
|
- `geometry` — GeoJSON (MultiPolygon via WKB-precision GEOS union)
|
|
- `ghsPopulation` — total population (GHS-POP 2030)
|
|
- `ghsPopMaxDensity` — peak population density
|
|
- `ghsPopCenter` — weighted population center `[lon, lat]`
|
|
- `ghsPopCenters` — top-N population peaks `[[lon, lat, density], ...]`
|
|
- `ghsBuiltWeight` — total built-up surface weight
|
|
- `ghsBuiltMax` — peak built-up value
|
|
- `ghsBuiltCenter` — weighted built-up center `[lon, lat]`
|
|
- `ghsBuiltCenters` — top-N built-up peaks `[[lon, lat, value], ...]`
|
|
|
|
## Architecture
|
|
|
|
See [docs/cpp-port.md](../docs/cpp-port.md) for the full spec.
|
|
|
|
```
|
|
src/
|
|
├── main.cpp # CLI entry, country loop, cache logic, PROJ_DATA setup
|
|
├── gpkg_reader.h/cpp # GeoPackage -> features (OGR C API)
|
|
├── geo_merge.h/cpp # Geometry union via WKB roundtrip (GEOS C API)
|
|
├── ghs_enrich.h/cpp # GeoTIFF raster sampling + PIP (GDAL + PROJ)
|
|
├── pip.h # Inline ray-casting point-in-polygon
|
|
└── types.h # BoundaryFeature, BoundaryResult structs
|
|
```
|
|
|
|
### Key Design Decisions
|
|
|
|
- **No OpenMP** — GDAL/PROJ/GEOS are not thread-safe for concurrent raster reads + transform creation. Sequential processing is I/O-bound anyway.
|
|
- **WKB precision** — geometry union uses WKB serialization to avoid floating-point drift from WKT roundtrips.
|
|
- **Mollweide projection** — uses `+proj=moll` string directly (not `EPSG:54009` which isn't in the PROJ database). Transforms are normalized via `proj_normalize_for_visualization` for correct lon/lat axis order.
|
|
- **Windowed raster I/O** — GDAL `GDALRasterIO` reads only the bbox-clipped window from multi-GB GeoTIFFs, keeping memory bounded.
|
|
- **PROJ_DATA auto-discovery** — `main.cpp` sets `PROJ_DATA` at startup pointing to the exe directory where `proj.db` is co-located.
|
|
- **Cross-platform** — builds on Windows (MSVC) and Linux (GCC/Clang) via vcpkg. Platform-specific code is guarded by `#ifdef _WIN32`.
|
|
|
|
### Recent Optimizations (40-50x Speedup)
|
|
- **Prepared GEOS Geometries**: `OGR_G_Contains` parsing was replaced with `GEOSPreparedContains_r`. Prepared geometries index the structure once, reducing millions of unprepared spatial lookups and point-instantiations inside the raster loop to microseconds.
|
|
- **Bounding Box Pre-filtering**: `BoundaryFeature` precalculates and stores `minX, minY, maxX, maxY` during GeoPackage ingestion, entirely skipping OGR bounding box and geometry checks per chunk.
|
|
- **GeoPackage Handle Caching**: Replaced sequential re-opening of the `GDALOpenEx` context across recursive boundary levels with a globally cached datastore handle, bypassing 1.8-second repeated disk-sweeping overhead.
|
|
- **Deferred GeoJSON Extraction**: GeoJSON serialization only happens for final merged features or un-merged singletons, circumventing string/parsing allocations overhead during intermediate GeoPackage reading.
|
|
|
|
### Suggestions for Further Optimization (I/O Bottleneck)
|
|
Right now, reading from the GeoPackage takes ~2.1 seconds per level per country because GDAL/OGR is performing full-layer scans (the 700MB SQLite files lack proper `B-Tree` spatial indexing on the `GID_0` column). Since the CPU/Geometry overhead was already optimized by ~50x, the remaining time is purely standard disk I/O.
|
|
To eliminate this I/O overhead entirely (`< 10ms` read times), consider:
|
|
1. **Offline Indexing:** Run a one-time indexing command before use (`ogrinfo data/gadm_410-levels.gpkg -sql "CREATE INDEX idx_gid0 ON ADM_ADM_1 (GID_0)"`) to bake B-Tree indices directly into your source `.gpkg`.
|
|
2. **Pre-partitioning / Sharding:** Split the massive monolithic GADM GeoPackage into individual `.gpkg` or `.geojson` chunks per country (e.g., `data/gadm/DEU.gpkg`), enabling true O(1) instantaneous access. |