Back to Homecastr
Engineering BlogMachine Learning

Architecting Nationwide Probabilistic Forecasts at the Parcel Level

How we evaluated against a Zillow-based baseline using spatial generative models, calibration testing, and faster inference.

Author Headshot

David Hardesty Lewis

ML Engineer & Founder

Many consumer real-estate products present a single forecast value. We instead estimate a distribution over future parcel values, because local housing markets are noisy, heterogeneous, and exposed to shocks that point forecasts compress into one number. Homecastr's pipeline produces probabilistic forecasts for over 150 million parcels.

1. Data and MLOps

At national scale, manual ETL does not hold up. Our pipeline ingests county assessment rolls, parcel geometries, and macro features from multiple public sources, including:

Transformed features are stored in PostGIS, Supabase, and Redis to support both batch training and low-latency serving.

Data Ingestion
Supabase / PostGIS
Florida DOR & NYC RPAD
Feature Store
Structural + Macro
Redis Batch Retrieval
Model Inference
Schrödinger Bridge
FT-Transformer
Serving Tier
Explainable SHAP
P10, P50, P90 Arrays

2. Model Architecture

In our experiments, gradient-boosted baselines produced competitive point forecasts but did not capture multi-horizon uncertainty as well as a generative approach. Our current model combines an FT-Transformer tabular encoder (a self-attention model over heterogeneous features) with learned spatial tokens that summarize nearby parcel context, and a Schrödinger Bridge diffusion decoder for multi-horizon uncertainty.

The feature set spans several categories: structural attributes (living area, land area, year built, bedrooms, bathrooms, stories), locational coordinates (latitude, longitude), and macroeconomic indicators (mortgage rates, federal funds rate, CPI, oil prices, unemployment, VIX, 10-year treasury, global economic policy uncertainty). We use DDIM (Denoising Diffusion Implicit Models) sampling to generate percentile paths (P10, P50, P90) directly, with loss normalized independently at each forecast horizon, instead of fitting uncertainty after the point forecast is produced.

3. Evaluation

Skipping offline evaluation raises the risk of unnoticed performance drift. We evaluate each model candidate on held-out years across two axes: point-forecast accuracy against an industry baseline, and probabilistic calibration.

1-Year Error (vs. Zillow Baseline)

8.0%8.4%

Lower 1-year held-out error than the Zillow baseline in our test set.

Long-Horizon Stability

25%MdAE

Median Absolute Error remained roughly stable over a 4-year forecast horizon.

Beyond standard error metrics, all model candidates pass through a calibration suite before promotion to production: PIT (Probability Integral Transform) histograms to verify that forecast quantiles are uniformly distributed, empirical interval coverage checks (does the 80% band actually contain ~80% of outcomes?), and tail calibration tests to ensure the P10 and P90 bands are not systematically too narrow or too wide.

Backtest Coverage — NYC Upper West Side (ZCTA 10025)

Each colored band shows what the model predicted (P10–P90) from a given origin year. The solid line shows what actually happened. Well-calibrated predictions should consistently cover the actuals. Click legend items to toggle vintages.

Loading backtest data…

4. Explainable Forecasts

Accuracy alone is not enough if users cannot understand why a forecast changed. We run a post-hoc attribution step we call Surrogate SHAP: a gradient-based method that extracts approximate local feature attributions from the FT-Transformer layers and maps them to readable variables (e.g., interest rates, local zoning changes, demographic shifts).

These attributions feed the "Explainable Forecasts" UI on the platform, giving users per-forecast breakdowns of which inputs contributed most to the predicted trajectory.

5. Serving and Cost

A model is less useful if inference cost or latency is too high for production. Once the offline pipeline completes, we split the workload into deterministic shards and run them in parallel to produce the full set of probabilistic arrays.

On the frontend, we use a three-tier server-side cache (Supabase, GCS, and Google APIs) with localized LRU caches for Street View images. This reduces our effective image cost to $0.007 per image and keeps typical render times below one second on tested mobile devices.


Interested in the engineering challenges behind Homecastr? Feel free to connect on LinkedIn.