Probabilistic Machine Learning for Parcel Forecasting at National Scale

1. Problem

Most public-facing home valuation products collapse an uncertain path into one number. A single value is easy to display, but it does not communicate downside risk, upside potential, or forecast uncertainty.

Homecastr produces forecast distributions across U.S. residential properties: P10, P50, and P90 trajectories at 1- to 5-year horizons. Building this required solving several problems that shaped the architecture:

Per-horizon calibration. Most probabilistic forecasts are calibrated in aggregate. Making an 80% prediction interval contain ~80% of outcomes for each geography and each forecast year, across properties ranging from $50K rural lots to $10M urban condos, is a harder problem. Calibration is more demanding at longer horizons: the forecast target (cumulative price change over several years) grows in scale, and the model must correctly widen its uncertainty estimates to match. These failures are difficult to detect without structured per-horizon evaluation infrastructure.
Data heterogeneity. Producing forecasts at national scale requires ingesting property records from heterogeneous county assessor sources, each with different schemas, identifiers, and valuation conventions. This challenge is not unique to this system; it faces any national property data effort. The architectural response was to encode cross-source heterogeneity implicitly through learned embeddings rather than conditioning on source identity, enabling one model to generalize across jurisdictions.
Distributional shift. The model learns historical price dynamics. When the macro environment diverges significantly from that history, as during the 2021–2022 rate shock, forecasts may degrade because the learned distribution no longer matches the current data-generating process. This divergence is not detectable in real time; it becomes visible only as realized values accumulate and the monitoring pipeline measures calibration failure after the fact.
Multi-year temporal coherence. Forecast paths need to look like realistic extensions of a property's historical curve. Independent per-horizon regressors produce paths that are not internally consistent across time: e.g., predicting 10% growth in year 2 but only 3% cumulative by year 3. A generative model produces paths that respect temporal dependencies by construction.
Cross-property spatial coherence. Adjacent properties in the same neighborhood tend to move together. A neighborhood-level event affects all parcels in an area simultaneously. Representing this requires a shared latent structure across properties that individual property features cannot provide, which motivated the shared spatial token mechanism.

2. Why Naive Approaches Fail

The challenges in Section 1 ruled out simpler approaches. Each limitation below directly motivated a component of the current architecture (Section 4).

Point Forecasts Miss the Distribution

A point forecast cannot tell a homeowner whether their downside risk is 5% or 25%. Post-hoc prediction intervals wrapped around a point estimate assume symmetric, fixed-width uncertainty, but real property value distributions are heavy-tailed and vary by geography and price segment. Calibrated intervals require learning the full predictive distribution directly, which motivated the generative architecture described in Section 4.

Independent Horizons Produce Incoherent Paths

Training separate regressors for each forecast horizon produces paths that are not internally consistent across time: a property might show 10% growth in year 2 but only 3% cumulative by year 3, which is impossible if year 3 follows year 2 in sequence. A generative model produces paths that respect temporal dependencies and look like realistic extensions of observed price histories. This also enables the model to represent corrections: growth in years 1–2 followed by mean reversion in year 3, which independent horizon regressors cannot express.

3. Data System

The pipeline ingests county assessment rolls from multiple jurisdictions, including Texas statewide via TxGIO, Florida statewide via DOR, Harris County via HCAD, and New York City via DOF RPAD. These are joined with ACS census-tract demographics and FRED macroeconomic series covering mortgage rates, CPI, unemployment, and other indicators. All sources are standardized into a canonical parcel-year schema that absorbs differences in identifiers, vintages, and suppression rules at the data layer. Source-specific characteristics and schema details are documented in the technical note.

Transformed features are stored across PostGIS/Supabase (geospatially indexed), GCS (Parquet bulk storage), and Redis (low-latency feature retrieval).

End-to-end system architecture: source data flows through a canonical panel into the model stack, with results streaming to serving infrastructure

Figure 1. End-to-end system architecture. Source data flows through a canonical panel into the model stack. The tabular transformer encoder produces a frozen trend estimate; the Schrödinger Bridge generative model samples calibrated residual trajectories conditioned on the encoder output and spatial token paths. Results stream to GCS as a write-ahead buffer before bulk upsert to Supabase.

4. Model Architecture

Why a Generative Architecture

The failure modes described in Section 2 each map to a specific architectural response. Temporal incoherence across horizons motivated a generative path model that samples complete trajectories rather than independent per-year values. Missing distributional shape motivated learning the full predictive distribution rather than wrapping a point estimate in post-hoc intervals. Unobserved neighborhood-level dynamics motivated a shared spatial token mechanism that correlates trajectories across proximate properties.

Current Production Architecture

The current system has three components. A Feature Tokenizer Transformer encoder [1], a self-attention encoder over heterogeneous tabular features, outputs a frozen deterministic trend prediction and a context embedding. Shared inducing tokens [2] with learned AR(1) persistence capture neighborhood-level dynamics that are not identifiable from property features alone. A simulation-free Schrödinger Bridge generative model [3] draws samples from the learned residual trajectory distribution via joint flow-and-score matching, conditioned on the encoder's trend estimate and the inducing token paths.

The model predicts year-over-year log-growth changes rather than absolute levels, which stabilizes learning across properties with different price scales. The generative model uses closed-form flow and score targets derived analytically from the bridge path, avoiding simulation during training, and is trained with per-horizon loss normalization to prevent forecast attenuation at longer lead times. Architecture details, equations, and training diagnostics are in the technical note.

AI-Native Interaction Layer

Traditional real estate platforms rely on rigid filters and map bounds. Homecastr surfaces its probabilistic outputs via an AI-Native Interaction Layer designed for both human and agentic consumption:

Semantic Omnibar: A natural language gateway that parses complex queries (e.g., “Show me areas in Austin with >15% upside and highly-rated schools”), maps them to neighborhood geometries, and instantly retrieves the relevant forecast distributions.
Model Context Protocol (MCP) API: Homecastr operates as an MCP server, exposing forecast distributions, appreciation outlooks, and comparable market data as standardized tools. This allows external reasoning models (like Claude or custom agentic swarms) to pull raw probabilistic context directly into their context windows for advanced downstream analysis.

5. Evaluation

We evaluate each model candidate on genuinely held-out origin years using an expanding-window protocol. Macro features are lagged by one year relative to the origin to prevent look-ahead contamination.

Baseline Comparison

Each model candidate is compared against a persistence baseline [6]: a forecast that simply repeats the last known value. The model must beat persistence on median absolute error for each origin-horizon combination.

Training Diagnostics

Tracked per-origin: final loss, effective token count, learned coherence scale, and token persistence. Details in the downloadable PDF.

Calibration Diagnostics

Before a model candidate enters production, it is evaluated against the diagnostic targets below [6]. These thresholds are not strict pass/fail gates. Actual results routinely fall outside these ranges, particularly during regime breaks. They serve as reference points that guide development decisions rather than automated acceptance criteria. A model that improves point accuracy but severely degrades calibration does not ship. These same diagnostics are re-run continuously as described in Section 7.

Test	What It Checks	Target	Threshold
Anchor Integrity	Forecast starting value matches the last observed price	Median log-ratio is small	< 0.10
Interval Coverage	Realized outcomes fall inside the 80% prediction band	Coverage in the expected range	65–95%
Horizon Scaling	Uncertainty grows at the expected rate over longer horizons	Variance ratio near √horizon	±30%
Point Accuracy	Median absolute error of P50 vs. realized value	MdAE within acceptable range	< 10% at h=1
Baseline Beat	Model outperforms naive persistence forecast	Wins on more parcels than loses	> 50%
Tail Accuracy	Extreme outcomes occur at the expected rate	Each tail near nominal rate	≈ 5%
Variance Ratio	Forecast spread proportional to historical spread	Ratio in plausible range	0.3–3.0
Distribution Match	Forecast growth resembles historical growth	KS test does not reject similarity	p > 0.01

We classify calibration failures into three common cases. Forecast attenuation: step deltas drop to near-zero beyond the first year, producing flat trajectories. Miscalibrated dispersion: prediction intervals are too narrow (overconfident) or too wide (uninformative) to match realized outcome rates. Mean bias: systematic over- or under-prediction, typically caused by regime drift. Each case has a separate diagnostic and a corresponding remediation path.

Backtest Results

On P10, P50, and P90. These are the 10th, 50th, and 90th percentiles of the sampled forecast distribution at each horizon. Specifically, the model draws 50 independent sample trajectories at inference; P10 and P90 are computed as marginal quantiles: the 10th-percentile value at horizon 3 and the 10th-percentile value at horizon 4 may come from different sample trajectories. The 80% prediction band (P10 to P90 at each horizon) is not the boundary of a single coherent worst-case or best-case scenario; it is the envelope of the marginal distributions, horizon by horizon.

Each row below reports the longest verifiable horizon for that origin year, the furthest point at which we can compare the model's forecast against realized values. MdAE is the median absolute percentage error of the P50 forecast versus realized values. Coverage 80% is the fraction of realized values falling inside the P10–P90 band at that horizon. Model Wins shows the fraction of properties where the model beats a naive persistence baseline. Bold values meet horizon-adjusted targets (MdAE < 10% × √h, Coverage 65–95%, Wins > 50%). Full per-horizon breakdowns are in the technical note.

Jurisdiction	Origin	h	MdAE	Coverage 80%	Model Wins
NYC RPAD	2025	1	7.0%	87.2%	56.7%
NYC RPAD	2024	2	8.5%	96.7%	51.8%
NYC RPAD	2023	3	9.8%	94.2%	63.6%
NYC RPAD	2022	4	12.6%	94.0%	73.6%
NYC RPAD	2021	5	16.1%	87.9%	59.1%
NYC RPAD	2020	5	12.6%	90.2%	72.0%
NYC RPAD	2019	5	48.6%	89.1%	12.4%
ACS Nationwide	2023	1	7.3%	71.2%	41.7%
ACS Nationwide	2022	2	12.7%	68.6%	87.1%
ACS Nationwide	2021	3	30.2%	40.4%	88.8%
ACS Nationwide	2020	4	35.6%	46.8%	93.2%
ACS Nationwide	2019	5	42.8%	52.9%	87.9%

Each row reports results at the longest verifiable horizon (h) for that origin. Coverage 80% measures the fraction of realized values falling within the predicted P10–P90 band. Bold values meet horizon-adjusted targets: MdAE < 10% × √h, Coverage 65–95%, Wins > 50%. The 2019 origins span the 2021–22 rate shock; elevated MdAE at h=5 for those vintages is expected given the regime break.

Backtest Coverage: NYC Upper West Side (ZCTA 10025)

Each colored band shows what the model predicted (P10 to P90) from a given origin year. The solid line shows what actually happened. Well-calibrated predictions should consistently cover the actuals. Click legend items to toggle vintages.

Loading backtest data…

6. Serving

The serving layer separates three concerns:

Batch Inference

Modal A100 Sharded

Deterministic shards, watchdog streaming to GCS, database-level resume on failure.

Cached Aggregations

Supabase, PostGIS

Pre-materialized geographic rollups (ZCTA, Tract, Neighborhood) for sub-second map tooltips.

Live Queries

Redis, LRU Caches

Feature retrieval for search and comparisons, with tiered caching for images.

Infrastructure powered byModal Google Cloud Supabase Redis

7. Monitoring

Continuous Calibration

The calibration diagnostics from Section 5 are not one-time. As new actuals arrive, the same tests run against the production model to detect calibration drift. If interval coverage, tail accuracy, or horizon scaling degrades below its target threshold, the model is flagged for retraining.

The pattern of which tests fail is diagnostic. Coverage dropping while anchor integrity holds suggests miscalibrated dispersion. Anchor integrity failing suggests mean bias from regime drift. Horizon scaling failing while short-term metrics hold suggests forecast attenuation. Each pattern maps to a different remediation path.

8. Limitations

Appraisal vs. transaction prices: The model trains on county-appraised values, not recorded sale prices. Appraised values can lag market movements, smooth over short-term volatility, and diverge from actual transaction prices in rapidly changing markets.

ACS self-reporting: Census tract demographics from the American Community Survey are self-reported survey estimates, not administrative records. They carry sampling error and non-response bias, particularly in small geographies and hard-to-reach populations.

Data coverage: Not all U.S. jurisdictions publish property assessment data in machine-readable formats. Prediction intervals are wider in data-sparse regions because the input signal is weaker.

Distributional shift: The model learns price dynamics from historical data. When the macro environment diverges significantly from that history, forecasts may degrade because the learned distribution no longer fits the current conditions. This divergence is not detectable in real time; it becomes visible only as realized values accumulate and the monitoring pipeline measures calibration failure after the fact.

Attribution gap: Feature attributions are computed from the deterministic backbone, not the full stochastic pipeline. There is a gap between the sum of explained drivers and the calibrated trajectory.

9. Roadmap

Broader Jurisdiction Coverage

Onboarding additional state-level assessment databases beyond the four jurisdictions currently in production.

Real-Time Macro Conditioning

Shifting from annual macro snapshots to higher-frequency conditioning to capture intra-year rate movements.

Causal Feature Integration

Incorporating structured indicators of local policy changes (zoning, permitting) to improve regime-change sensitivity.

Interactive Scenario Analysis

Enabling users to condition forecasts on hypothetical macro scenarios ("What if rates drop 200bp?").

Explainable Forecasts Coming Soon

Post-hoc feature attribution from the deterministic backbone, surfacing directional driver summaries (rate expectations, supply-demand, regime shifts) in the UI.

References

[1] Y. Gorishniy et al. Revisiting Deep Learning Models for Tabular Data. NeurIPS 2021. arXiv:2106.11959

[2] J. Lee et al. Set Transformer: A Framework for Attention-based Permutation-Invariant Input. ICML 2019. arXiv:1810.00825

[3] J. Ho, A. Jain, P. Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239

[4] J. Song, C. Meng, S. Ermon. Denoising Diffusion Implicit Models. ICLR 2021. arXiv:2010.02502

[5] T. Hang et al. Efficient Diffusion Training via Min-SNR Weighting Strategy. ICCV 2023. arXiv:2303.09556

[6] T. Gneiting, A. E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. JASA, 102(477):359–378, 2007. doi:10.1198/016214506000001437

Data source references and full citation details are in the technical note (PDF).

Interested in the engineering challenges behind Homecastr? Connect on LinkedIn · About the team