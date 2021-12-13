



Posted by: Sercan O. Arik, Research Scientist, Tomas Pfister, Engineering Manager, Google Cloud

Multi-horizon forecasting, or predicting variables of interest in multiple future time steps, is an important challenge in time-series machine learning. Most real datasets have a time component that can be of great value by predicting the future. For example, retailers are interested in using future sales to optimize supply chains and promotions, investment managers are interested in predicting future prices of financial assets to maximize performance, and healthcare institutions are interested in future patients. You can use the number of hospitalizations to secure sufficient personnel and equipment. ..

Deep Neural Networks (DNNs) are increasingly being used in multi-horizon forecasting, demonstrating significant performance improvements over traditional time series models. Many models (DeepAR, MQRNN, etc.) focus on variants of recurrent neural networks (RNNs), but recent improvements, including Transformer-based models, use attention-based layers to relate to the past. We are strengthening the selection of time steps. Sequential ordering of information including RNN induction bias. However, they often do not take into account the various inputs commonly present in multihorizon predictions and assume that all extrinsic inputs are known in the future or have important static covariates. Ignore it.

Multi-horizon prediction using static covariates and various time-dependent inputs.

Moreover, traditional time series models are controlled by complex non-linear interactions between many parameters, making it difficult to explain how such models reach predictions. Unfortunately, there are limits to the general way to explain how DNN works. For example, post-methods (such as LIME and SHAP) do not consider the order of input features. While some attention-based models have been proposed primarily with interpretability specific to sequential data such as language or speech, multi-horizon prediction is not limited to language or speech, but of various types. There is an input. Attention-based models can provide insight into the relevant timesteps, but cannot distinguish the importance of different features at a particular timestep. New methods are needed to address the data inhomogeneities of multi-horizon forecasts for high performance and make these forecasts interpretable.

To that end, we are announcing the “Time Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting” published in the International Journal of Forecasting. Here, we propose a temporal fusion transformer (TFT), which is a attention-based DNN model for multi-horizon prediction. .. TFTs are designed to explicitly align the model with common multi-horizon forecasting tasks for both excellent accuracy and interpretability. This has been demonstrated in various use cases.

Temporal Fusion Transformer For high predictive performance, design the TFT to efficiently build feature representations for each input type (that is, static, known, or observed inputs). The main components of TFT (shown below) are:

A gating mechanism that skips unused components of the model (learning from the data) provides adaptive depth and network complexity to accommodate a wide range of datasets. A variable selection network for selecting relevant input variables at each time step. While traditional DNNs can fit too much into irrelevant features, attention-based variable selection can improve generalization by encouraging the model to lock most of its learning ability to the most prominent features. increase. The static covariate encoder integrates static features to control how temporal dynamics is modeled. Static features can have a significant impact on forecasting. For example, store locations can have different temporal dynamics in sales (for example, rural stores can have more weekend traffic, while downtown stores can peak daily after working hours. I have). Time processing for learning both long-term and short-term temporal relationships from both observed and known time-varying inputs. Inter-sequence layers are used for local processing because of the inductive bias they have for ordered information processing, while long-term dependencies are captured using the new interpretable multi-head attention block. Will be. This can reduce the length of the valid path of information. This means you can directly focus on past time steps (such as last year’s sales) that include relevant information. The prediction interval shows the quantile prediction for determining the range of target values ​​for each prediction period. This helps the user understand the distribution of the output as well as the point prediction. The TFT inputs static metadata, past inputs of time-varying, and pre-known future input of time-varying. Variable selection is used to carefully select the most prominent features based on the input. The gate information is added as a residual input and then normalized. The Gated Residual Network (GRN) block enables efficient information flow using skip connections and the gate layer. Time-dependent processing is based on LSTMs for local processing and multi-head attention to integrate information from any time step.

Compare Predictive Performance TFTs with different models of multi-horizon prediction, including various deep learning models using iterative methods (DeepAR, DeepSSM, ConvTrans, etc.) and direct methods (LSTM Seq2Seq, MQRNN, etc.) and traditional methods. .. Models such as ARIMA, ETS, TRMF. Below is a comparison with the truncated list of models.

Model Power Traffic Volatility Retail ARIMA 0.154 (+ 180%) 0.223 (+ 135%) — ETS 0.102 (+ 85%) 0.236 (+ 148%) — DeepAR 0.075 (+ 36%) 0.161 (+ 69%) ) 0.050 (+ 28%) 0.574 (+ 62%) Seq2Seq 0.067 (+ 22%) 0.105 (+ 11%) 0.042 (+ 7%) 0.411 (+ 16%) MQRNN 0.077 (+ 40%) 0.117 (+ 23%) ) 0.042 (+ 7%) 0.379 (+ 7%) TFT 0.055 0.095 0.039 0.354

As shown above, TFT outperforms all benchmarks on different datasets. This is true for both point predictions and uncertainty estimates, where TFTs result in average 7% lower P50 and 9% lower P90 losses compared to the suboptimal model, respectively.

Interpretable Use Cases Three use cases are used to demonstrate that the TFT design allows analysis of individual components and improves interpretability.

By observing the weights of the variable importance model, you can observe how different variables affect retail sales. For example, the maximum weights for static variables were specific stores and items, while the maximum weights for future variables were promotion periods and national holidays (see below). Retail datasets vary in importance. Variable selection weights in the 10th, 50th, and 90th percentiles are displayed, and values ​​greater than 0.1 are displayed in bold purple. Persistent Time Patterns Visualizing persistent time patterns can help you understand the time-dependent relationships that exist in a particular dataset. Similar permanent patterns are identified by measuring the contribution of features at fixed lags to past predictions on various horizon. As shown below, attention weights reveal the most important past time steps on which the TFT is based on its decision. Persistent time patterns for 10%, 50%, and 90% quantile-level traffic datasets (indicating expected time periods). A clear periodicity is observed, with peaks separated by about 24 hours. That is, the model participates most in the time step, which is the same time as the past day. This matches the expected daily traffic pattern.

The above shows the attention weighting pattern over time. This shows how TFT learns a persistent time pattern without hard coding. Such features help build trust with users by looking at known patterns of expected output. Model developers can also use them to improve their models, such as through specific feature engineering and data collection.

Identifying Important Events It is useful to identify sudden changes, as the presence of important events can cause temporary shifts. The TFT uses the distance between attention patterns at each point and the average pattern to identify significant deviations. The figure below shows that the TFT can change attention between events. If the volatility is low, the same attention is paid to the entire past input, and during the high volatility period, the attention is paid to the rapid change in the trend. Event identification of the S & P 500 achieved volatility between 2002 and 2014.

Significant deviations in attention patterns can be observed before and after a period of high volatility, corresponding to the peaks observed at dist

Focusing on the period before and after the 2008 financial crisis, the lower plot zooms in on the middle of important events compared to the normal events in the upper plot (equal attention in the low volatility period). This is evident from the growing attention to rapid trend changes). ).

Event identification on the S & P 500 has achieved volatility. The above zoom was achieved in 2004 and 2005. Event identification on the S & P 500 has achieved volatility and the zoom mentioned above in the 2008 and 2009 periods.

Real-World Impact Finally, TFTs are used to help retail and logistics companies forecast demand by improving forecast accuracy and providing interpretability.

In addition, TFTs have potential uses for climate-related challenges. For example, balancing the supply and demand of electricity in real time reduces greenhouse gas emissions and improves the accuracy and interpretability of rainfall forecast results.

Conclusion We present a new attention-based model for high-performance multi-horizon prediction. In addition to improving performance across different datasets, TFTs also include special components for unique interpretability: variable selection networks and multi-head attention that can be interpreted. We also show how to use these components to extract insights into the importance of features and temporal dynamics using three interpretable use cases.

Acknowledgments Thanks to Bryan Lim, Nicolas Loeff, Minho Jin, Yaguang Li and Andrew Moore for their contributions.

