51
Outpost 2 Programming & Development / Deepening the decompilation - Phase 3G: more RE steps (planned vs actual)
« Last post by jonathangoorin on April 04, 2026, 07:33:19 AM »Update - Phase 3G finished (what we planned vs what we got)
Below is Outpost2.exe only (the main game binary). Subsystem counts are index rows (Ghidra can emit more than one `.c` row per VA).
Where we started vs where we landed
So: +86 readable names on the EXE, -86 `FUN_*`, and ~148 fewer `unclassified` rows - not the original stretch goal of 80-90% named, but real progress with everything scripted and repeatable.
Step by step (expectation vs reality)
3G-1 - CRT / compiler tail
- Done: FID pass on the CRT address tail only, then rename leftover tail `FUN_*` to `CRT::r_<hex>`; subsystem indexer gets a `crt` bucket (`crt_ranges.json`, `apply_fidb.py`, `identify_crt.py`).
- Expected: hundreds of CRT names/classifications.
- Actual: FID barely helps on this VC++5-era EXE; most of the win is **address-based tail + `CRT::r_*`. 46 EXE rows land in `crt`. DLLs pick up more FID hits in-range.
3G-2 - RTTI / vtables
- Done: PE-first RTTI parse (`recover_rtti.py`, `rtti_pe.py`), optional Ghidra apply, `rtti-classes.md` + JSON.
- Expected: 100-300 functions renamed from RTTI alone.
- Actual: Community Tethys VAs already named most stream/GFX vtable slots. RTTI confirms hierarchies and only 6** extra `FUN_*` renames on apply. OP2Shell / op2ext in the OPU tree: no MSVC RTTI anchor found.
3G-3 - MSVC demangle (exports)
- Done: Vendored demangler, `demangled-symbols.json` for all 509 EXE exports, `apply_demangled.py` in Ghidra.
- Expected: 50-200 renames.
- Actual: 23 Ghidra renames - mostly normalizing decorated leftovers; 509/509 exports demangle in JSON. Most export entry points were already named; skips 115 existing `Tethys__*` labels.
3G-4 - Call-graph subsystem propagation
- Done: Parse pseudo-C call edges, iterative neighbor voting, update `subsystem-index.json` (`callgraph-classification.md`, edge JSON).
- Expected: 500-800 index rows moved off `unclassified`.
- Actual: 153 unique VAs newly classified; `unclassified` 1389 -> 1234 rows. Graph is sparse (only calls visible in decompiler output) and the 70% vote threshold avoids mis-labeling hubs - so yield is much lower than the sketch.
3G-5 - Singleton globals (`DAT_` + raw hex)
- Done: Extended `DAT_*` rules in `subsystem_index.py`, full-file scanner `tag_by_globals.py` against `singletons.json`.
- Expected: dozens to ~130 new classifications.
- Actual: Only 2 new `global_xref` VAs; almost everything touching known singletons was already classified. 171 `dat_global` rows after regen; `unclassified` down to ~1228 after 4+5 pipeline.
3G-6 - String xrefs -> names
- Done: Ghidra `name_by_strings.py` - defined string data, xrefs into `FUN_*`, scored labels with `_msg` suffix, `--apply` + re-decompile.
- Expected: ~50-100 renames.
- Actual: 34 `FUN_*` renamed on EXE (census 1219 -> 1185 `FUN_*` at that snapshot). Many string refs sit in already-named functions; strict scoring drops format noise.
3G-7 - Scenario DLL import audit
- Done: `harvest_scenario_imports.py` - union every shipping DLL's imports from `Outpost2.exe`, diff vs exports + Tethys + index (PE only, no Ghidra).
- Expected: maybe 10-50 new names from gaps.
- Actual: 0 new Ghidra names - validation only: 279 unique imports, 0 orphan vs export table, 0 `FUN_*` at those import VAs. Proves mission API surface is already covered. 230 / 509 exports are never imported by any scanned DLL (engine-internal / unused).
Bottom line
Phase 3G-1 through 3G-7 are done. The payoff is a reproducible pipeline (decompile, index, call graph, globals, string pass, PE cross-check) more than a single headline percentage. There is still a large `FUN_*` and `unclassified` tail on the EXE - what is left is mostly harder or riskier than these automated passes.
Cheers
Jonathan
Below is Outpost2.exe only (the main game binary). Subsystem counts are index rows (Ghidra can emit more than one `.c` row per VA).
Where we started vs where we landed
| Metric | Before Phase 3G | After Phase 3G (now) |
| Named (not `FUN_*`) | 1,751 (57.9%) | 1,837 (60.8%) |
| Auto `FUN_*` | 1,271 | 1,185 |
| `unclassified` subsystem rows (index) | ~1,390 | 1,242 |
| New `crt` bucket (runtime tail tagged) | (none) | 46 rows on EXE |
So: +86 readable names on the EXE, -86 `FUN_*`, and ~148 fewer `unclassified` rows - not the original stretch goal of 80-90% named, but real progress with everything scripted and repeatable.
Step by step (expectation vs reality)
3G-1 - CRT / compiler tail
- Done: FID pass on the CRT address tail only, then rename leftover tail `FUN_*` to `CRT::r_<hex>`; subsystem indexer gets a `crt` bucket (`crt_ranges.json`, `apply_fidb.py`, `identify_crt.py`).
- Expected: hundreds of CRT names/classifications.
- Actual: FID barely helps on this VC++5-era EXE; most of the win is **address-based tail + `CRT::r_*`. 46 EXE rows land in `crt`. DLLs pick up more FID hits in-range.
3G-2 - RTTI / vtables
- Done: PE-first RTTI parse (`recover_rtti.py`, `rtti_pe.py`), optional Ghidra apply, `rtti-classes.md` + JSON.
- Expected: 100-300 functions renamed from RTTI alone.
- Actual: Community Tethys VAs already named most stream/GFX vtable slots. RTTI confirms hierarchies and only 6** extra `FUN_*` renames on apply. OP2Shell / op2ext in the OPU tree: no MSVC RTTI anchor found.
3G-3 - MSVC demangle (exports)
- Done: Vendored demangler, `demangled-symbols.json` for all 509 EXE exports, `apply_demangled.py` in Ghidra.
- Expected: 50-200 renames.
- Actual: 23 Ghidra renames - mostly normalizing decorated leftovers; 509/509 exports demangle in JSON. Most export entry points were already named; skips 115 existing `Tethys__*` labels.
3G-4 - Call-graph subsystem propagation
- Done: Parse pseudo-C call edges, iterative neighbor voting, update `subsystem-index.json` (`callgraph-classification.md`, edge JSON).
- Expected: 500-800 index rows moved off `unclassified`.
- Actual: 153 unique VAs newly classified; `unclassified` 1389 -> 1234 rows. Graph is sparse (only calls visible in decompiler output) and the 70% vote threshold avoids mis-labeling hubs - so yield is much lower than the sketch.
3G-5 - Singleton globals (`DAT_` + raw hex)
- Done: Extended `DAT_*` rules in `subsystem_index.py`, full-file scanner `tag_by_globals.py` against `singletons.json`.
- Expected: dozens to ~130 new classifications.
- Actual: Only 2 new `global_xref` VAs; almost everything touching known singletons was already classified. 171 `dat_global` rows after regen; `unclassified` down to ~1228 after 4+5 pipeline.
3G-6 - String xrefs -> names
- Done: Ghidra `name_by_strings.py` - defined string data, xrefs into `FUN_*`, scored labels with `_msg` suffix, `--apply` + re-decompile.
- Expected: ~50-100 renames.
- Actual: 34 `FUN_*` renamed on EXE (census 1219 -> 1185 `FUN_*` at that snapshot). Many string refs sit in already-named functions; strict scoring drops format noise.
3G-7 - Scenario DLL import audit
- Done: `harvest_scenario_imports.py` - union every shipping DLL's imports from `Outpost2.exe`, diff vs exports + Tethys + index (PE only, no Ghidra).
- Expected: maybe 10-50 new names from gaps.
- Actual: 0 new Ghidra names - validation only: 279 unique imports, 0 orphan vs export table, 0 `FUN_*` at those import VAs. Proves mission API surface is already covered. 230 / 509 exports are never imported by any scanned DLL (engine-internal / unused).
Bottom line
Phase 3G-1 through 3G-7 are done. The payoff is a reproducible pipeline (decompile, index, call graph, globals, string pass, PE cross-check) more than a single headline percentage. There is still a large `FUN_*` and `unclassified` tail on the EXE - what is left is mostly harder or riskier than these automated passes.
Cheers
Jonathan

Recent Posts