Maria Castillo spent $18,400 on a whole-home energy retrofit in 2024. New insulation in the attic and crawlspace, a heat pump replacing a twenty-year-old gas furnace. Air sealing around every window and door, measured by a blower door test that confirmed her contractor had reduced infiltration by 42 percent. An energy model, generated by her utility's rebate program software, projected annual savings of $1,920, which meant a payback period of under ten years and a significant reduction in her natural gas consumption through the Colorado winters.
Her actual savings after twelve months of utility bills: $408. Not $1,920. Not even close to $1,920. A 3.4 percent reduction in total energy costs where the software had promised twenty-five, and she is not unusual, not unlucky, not a statistical outlier living in a poorly oriented house with bad ductwork and teenagers who leave every window open from October to March.
A systematic review published in the Annual Review of Resource Economics examined 39 evaluations of 23 residential retrofit programs spanning 1984 to 2021, covering 140,977 retrofitted households whose post-retrofit energy consumption was measured with actual billing data. Not modeled. Not simulated. Measured. Average reduction in electricity or fuel consumption across the entire sample: 7.2 percent. None of the programs in the sample achieved deep energy savings of 50 percent or greater.
That number understates the problem. Many households use both electricity and natural gas, so total household energy savings were almost certainly smaller than 7.2 percent because the studies measured individual fuel types, not combined consumption. And the review found something that should worry anyone commissioning an energy model: reported savings decreased as study rigor increased. Programs evaluated with weak methodologies claimed higher savings. Programs evaluated with randomized controlled trials and matched comparison groups found less.
What 11 Million Utility Bills Actually Show
Researchers at UCLA conducted what may be the largest empirical analysis of residential energy upgrades ever attempted, examining billing records for 11 million households in Southern California across multiple subsidy programs from 2010 to 2015. Overall electricity reduction from adopting efficiency upgrades: 4 percent.
Four percent. Pool pump replacements delivered the best returns at 13 percent. Refrigerator upgrades managed 6 percent. But HVAC retrofits, the single largest line item in most whole-home energy upgrades and the category that energy models are most confident about predicting, produced savings of less than 1 percent in the measured data. One percent. Lighting upgrades performed similarly poorly. Building envelope improvements, the insulation and air sealing work that forms the foundation of nearly every retrofit recommendation, showed rebound effects in some cases, meaning households actually increased their energy consumption after the upgrade because they felt entitled to set their thermostats higher in a better-insulated home, which is the kind of behavioral response that makes perfect economic sense and renders thermodynamic modeling useless as a savings predictor.
The UCLA authors were blunt: "Energy savings are inconsistent with the engineering estimates."
A Randomized Trial Put a Number on the Gap
Engineering estimates are not guesses. They are calculations based on thermodynamic models, R-values, SEER ratings, heating degree days, and building envelope specifications that have been refined over decades by competent engineers working with genuine physics. They should be accurate, and the fact that they are systematically wrong by factors of two to four demands an explanation beyond "occupant behavior varies."
Economists Meredith Fowlie, Michael Greenstone, and Catherine Wolfram provided one. Their study of Michigan's Weatherization Assistance Program, conducted through the University of Chicago's Energy Policy Institute, used a randomized controlled trial to compare homes that received weatherization to comparable homes that did not. Randomization eliminates self-selection bias, weather confounds, and the other methodological weaknesses that inflate savings estimates in observational studies.
Model projections were 2.5 times higher than actual measured savings. Costs doubled the returns. Annual returns on the efficiency investment: negative 7.8 percent.
Negative.
The Department of Energy contested these findings, pointing to its own Oak Ridge National Laboratory analysis that estimated a 4-to-1 benefit-cost ratio when including health, safety, and non-energy benefits. Fowlie and colleagues published a rebuttal challenging the DOE's valuation methods and statistical assumptions. The debate is instructive because it reveals how badly the field wants these programs to work, and how willing institutional actors are to broaden the definition of "benefits" when the energy savings alone cannot justify the expense.
AI Makes the Wrong Number Faster
Into this gap between promise and measurement walks a new generation of AI-powered energy modeling tools. AutoML platforms now achieve R-squared values of 0.993 on building energy prediction benchmarks, according to a 2026 study published in MDPI's Buildings journal. SHAP interpretability analysis identifies solar heat gain coefficient and U-values as the dominant predictive features. Rocky Mountain Institute found that algorithm-based energy estimates for 8,000 homes across 27 states showed 20 to 30 percent average absolute difference from on-site assessments, which RMI characterized as "accurate enough" to be useful.
Accurate enough for what, exactly? For telling a homeowner she will save $1,920 a year when the actual number, measured across a hundred and forty thousand homes by researchers with no financial stake in the retrofit industry, is closer to $408?
An R-squared of 0.993 sounds extraordinary until you understand what it measures. These models are validated against other models, not against twelve months of post-retrofit utility bills. They are trained on simulated building energy data, which means they learn the patterns embedded in the same engineering assumptions that the Michigan RCT proved wrong by a factor of 2.5. A model trained on biased data does not produce unbiased predictions. It produces biased predictions with impressive confidence intervals and three decimal places of false precision.
A manual energy model tells you: "You'll save 25 percent on heating." An AI energy model tells you: "You'll save 24.7 percent on heating, R-squared equals 0.993." Both are probably off by 15 to 18 percentage points. But the AI version discourages the skepticism that the manual version at least left room for. Three decimal places read like certainty to a homeowner writing a $15,000 check.
Where the Models Break
A review of the energy performance gap published in Frontiers in Mechanical Engineering analyzed 62 case study buildings and found that actual energy use deviated from predictions by an average of 34 percent, with a standard deviation of 55 percent. Some buildings used twice their modeled consumption. The dominant factors driving the gap, in order of estimated impact: specification uncertainty in modeling (20 to 60 percent effect), occupant behavior (10 to 80 percent), and poor operational practices (15 to 80 percent).
Occupant behavior is the elephant that energy models pretend is a mouse. A TU Delft analysis of over one million Dutch homes found that some dwellings showed three to four times the difference between predicted and actual consumption. Not 20 percent off, not 50 percent. Three hundred percent off, driven not by insulation quality or window specifications or any variable that appears in a building information model, but by the humans who live inside the building, who open windows in January because they like fresh air, who run space heaters in rooms the central system already conditions, who leave for vacation and forget to adjust the thermostat, who cook elaborate meals six nights a week or never cook at all, and whose aggregate behavior varies so enormously from household to household that no standardized schedule assumption can represent even a plurality of them accurately.
AI models handle occupant behavior the same way traditional models do: they assume standardized schedules, thermostat set points, and occupancy patterns derived from survey averages. SHAP analysis on the best-performing AutoML models highlights building envelope variables as the dominant features. Occupant behavior does not appear as a top feature because it is not in the training data in any useful form. This means the single largest source of prediction error is systematically excluded from the models that claim to predict most accurately.
What You Should Actually Expect
If you are considering a whole-home energy retrofit and an energy model says you will save 20 to 30 percent, divide that number by three. You will land closer to the measured reality across 140,000 homes.
Specific guidance based on the empirical evidence, not the models:
Insulation and air sealing are the most consistently effective retrofits in the systematic review data. Budget $3,000 to $6,000 for attic and crawlspace work. Expect 5 to 10 percent reduction in heating costs, not the 25 percent that models project. Still worth doing if your home was built before 1980 and has less than R-19 in the attic.
Heat pump replacement of a gas furnace makes thermodynamic sense in climate zones 3 through 5 where heating degree days are moderate and electricity rates are below $0.15 per kilowatt-hour. Budget $8,000 to $14,000 installed. But the UCLA data showed HVAC retrofits delivering less than 1 percent measured electricity savings in Southern California, a region where heat pumps should perform at their theoretical best. In cold climates, expect the gap between modeled and actual savings to widen, not narrow, because heat pump efficiency degrades at low temperatures in ways that models account for mathematically but real-world installations often underperform due to duct losses, refrigerant charge issues, and auxiliary heat strip activation.
Smart thermostats delivered some of the best cost-per-saved-kWh ratios in the systematic review. At $150 to $300 installed, even a 3 percent reduction in HVAC runtime pays back in under two years. Modest savings. Modest investment. That combination pencils.
Pool pumps and refrigerators quietly outperform sexier upgrades. If you have a single-speed pool pump, a variable-speed replacement saves 13 percent on pool-related electricity, and the appliance often qualifies for utility rebates that cover 30 to 50 percent of cost.
What AI Could Do Right
None of this means AI energy modeling is worthless, far from it. It means the current generation of models is solving the wrong problem, optimizing predictions of simulated building performance when they should be predicting actual utility bills, which are influenced by a completely different set of variables that no physics engine captures.
ACEEE found that calibrating energy models against historical utility billing data improved realization rates from 61 percent to 91 percent. That is a meaningful improvement, achieved not by building a better thermodynamic model but by checking the model's output against the only number that matters: what the homeowner actually pays.
The first company to train a model on ten million paired pre-retrofit and post-retrofit utility billing records, matched to specific upgrade packages and climate zones, will produce genuinely useful predictions. That dataset exists in fragments across utility rebate programs, weatherization agencies, and ENERGY STAR portfolios. Nobody has assembled it at scale and made it available for machine learning. When someone does, the predictions will probably be less impressive on paper, with R-squared values closer to 0.7 than 0.99, and far more useful in practice, because they will reflect the messy reality of occupant behavior, contractor quality, and equipment performance degradation rather than the clean physics of a simulated building that no one actually lives in.
What This Analysis Did Not Prove
I should be honest about the limits of this argument. The 7.2 percent average includes retrofit programs dating to 1984, and modern materials, installation techniques, and building science have improved meaningfully since then. Deep retrofits, combining envelope work with electrification and renewable generation, can deliver 58 to 79 percent reductions according to ACEEE, though these cost $40,000 to $80,000 and are not what most homeowners mean when they say "retrofit." The performance gap studies from Frontiers analyzed non-domestic buildings, which may overstate the gap for single-family homes. Climate zone matters enormously, and national averages obscure regional variation. And some homeowners do save more than predicted, because they are more careful with energy than the standardized assumptions give them credit for.
But the core finding holds across every rigorous study I reviewed: models systematically overpredict savings, the gap is large, and increasing model sophistication has not closed it. If anything, the overconfidence that comes with machine learning precision makes the problem worse by suppressing the healthy skepticism that a rougher estimate would preserve.
Write that on the spreadsheet, then write the check.