Abstract Machine learning (ML) poses a potential paradigm shift in weather forecasting, but critical questions arise regarding its ability to predict high‐impact weather events. This study evaluates five state‐of‐the‐art ML models—Aurora, GraphCast, PanguWeather, FourCastNetV2, FourCastNet—in forecasting U.S. West Coast atmospheric rivers (ARs), compared to the high‐performing physics‐based European Center for Medium‐Range Weather Forecasts’ high‐resolution system (HRES) model. Analysis of 152 daily forecast cycles (November 2023–March 2024) reveals significant performance differences between the systems. While ML models often show better variable‐specific root mean square error (RMSE), HRES has superior AR detection skill for the first four forecast days. PanguWeather matches HRES skill beyond day four; other ML models lag slightly. Aurora consistently exhibits the lowest AR detection performance, despite strong variable‐specific RMSE metrics, highlighting a disconnect between RMSE performance and its ability to predict AR events. These findings underscore the need for phenomenon‐specific metrics for ML‐based numerical weather prediction model assessment and operational implementation.

Read original article