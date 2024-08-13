Sports
Momentum prediction models of tennis matches based on CatBoost regression and random forest algorithms
In this section, we are tasked to develop a model that can capture momentum to describe the probability of scoring in a match. For this purpose, we adopted a stacking model strategy and set up the CBRF (Composite Bayesian Regression Framework) prediction model to describe the match process. The flowchart of this model is illustrated in Fig. 4.
Based on the decision tree CatBoost regression prediction model
Decision tree regression model
The process of generating a decision tree involves the continuous grouping of the training sample set. The branches of the decision tree grow gradually as the data is further segmented. The core technique in the growth of a decision tree is the question of selection of test attributes12We use the data after dimensionality reduction as the independent variable and point victor as the dependent variable to perform machine learning with the decision tree regression prediction model, obtaining the predicted model results.
When using machine learning algorithms for predictions, it is common to assess the accuracy of these predictions using various statistical metrics. These metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination \(R^{2}\) The formulas for calculating these statistics are as follows:
$$\begin{aligned} MSE= & {} \frac{1}{N}{\textstyle \sum _{i=1}^{N}}(y_{i}-{\hat{y}}_ {i})^{2},\\ RMSE= & {} \sqrt{\frac{1}{N}{\textstyle \sum _{i=1}^{N}}(y_{i}-{ \hat{y}} _{i} )^{2}},\\ MAE= & {} {\frac{1}{N}{\textstyle \sum _{i=1}^{N}}\ left| y_{i}-{\hat{y}} _{i}\right| },\\ MAPE= & {} {\frac{100\%}{N}{\textstyle \sum _{i=1}^{N}}\left| \frac{y_{i}-y_{i}^{'}}{y_{i} }\right| },\\ R^{2}= & {} 1-\frac{{\textstyle \sum _{i=1}^{N}}(y_{i}-{\hat{y}}_{i })^{2}}{{\textstyle \sum _{i=1}^{N}}(y_{i}-{\bar{y}}_{i})^{2}}. \end{aligned}$$
In the given context, \(j_{i}\) represents the actual value of the \(i{\text {the}}\) sample, \({\hat{y}}_{i}\) is the predicted value of the \(i{\text {the}}\) sample, N is the total number of samples,\({\bar{y}}\) is the average of all actual values.
Using the formulas given above, the results can be calculated and presented as shown in Table 4:
When analyzing the evaluation results in Table 1, it was found that the value was too small. Therefore, we decided to optimize the decision tree regression prediction model and use the CatBoost regression prediction model for re-prediction.
CatBoost regression prediction model based on decision tree regression model
CatBoost is a framework that relies on symmetric decision trees as base learners, characterized by fewer parameters and support for multivariate analysis. Its primary advantage lies in efficiently and reasonably addressing prediction bias, thereby minimizing overfitting and improving the accuracy and generalizability of the models.13The expression for this model is:
$$\begin{aligned} x_{i,k} =\frac{ \sum _{j=1}^{p-1} [x_{\sigma _{j,k} }=x_{\sigma _{p,k} } ]\cdot Y_{j}+ a\cdot p }{\sum _{j=1}^{p-1} [x_{\sigma _{j,k} }=x_{\sigma _{p,k} } ]+a}. \end{aligned}$$
In the formula, \(\sigma_{j}\) represents the model output for the \(j{\text {the}}\) data point; \(x_{i,k}\) indicates the discrete feature in the \(k{\text {the}}\) column of the \(i{\text {the}}\) row in the training dataset; A is a prior weight; and P represents the prior distribution term. The predicted model results are evaluated, with the evaluation results presented in Table 5, and the formulas for calculating evaluation result parameters are given in section Decision Tree Regression Model.
Although this model fits the data significantly better than the random forest model, the predictive performance is still not ideal. Although the model performs well on training data, the performance deteriorates significantly on test data, with increased prediction errors and reduced accuracy. Furthermore, the coefficient of determination is very high on the training set, but decreases significantly on the test set, indicating an overfitting problem.
Building the CBRF prediction model based on the CatBoost regression model and the random forest regression model
To make the model more accurate, taking into account that in tennis matches the serving side often has a higher chance of scoring22we process the weight of the serving sides \(Si}\) as an additional variable in the CatBoost regression model. In this way, the prediction function of the model depends on both the original data \(Si}\) and weigh the serving sides \(Si}\).
$$\begin{aligned} {\hat{y}}_{i}^{(CB)}=f_{CB}(X_{i},S_{i}).\end{aligned}$$
In this context, \({\hat{y}}_{i}^{(CB )}\) represents the predicted probability of winning, while \(f_{CB}\) is the prediction function constructed using the CatBoost regression model.
In addition, Random Forest is a supervised machine learning method constructed via an ensemble of decision trees as base learners. It introduces randomness into the decision tree training process, which gives it excellent anti-overfitting and noise-resilience capabilities.16.
In order to make machine learning prediction more accurate, we stack the CatBoost regression model on the Random Forest regression model to construct the CBRF prediction model. In this model, we use the prediction results of the CatBoost regression model as input data to build the Random Forest regression model:
$$\begin{aligned} {\hat{y}}_{i}^{(CBRF)} =f_{CBRF} \left( {\hat{y}}_{i}^{(CB )},S_{i},X_{i}^{'} \right) . \end{aligned}$$
In this expression, \({\hat{y}}_{i}^{(CBRF )}\) represents the predicted value of the CBRF prediction model, and \(f_{CBRF}\) is the prediction function of the CBRF prediction model, which includes variables \(Si}\).
In addition, we use triple cross-validation to improve the accuracy of model training and assess the stability of the model during the training process. The ratio of the training set to the dataset in the total data is 4:1.
After training, we obtain the predicted results for point-victor. By comparing these predicted results with the original values of point-victor, and considering the large volume of original data, we select the predicted results of one match for visualization, resulting in Fig. 5 as shown:
The figure above shows that the predicted values are very close to the actual values.
Predictive evaluation analytics
The predicted model results are evaluated. The evaluation results are shown in Table 6. The formulas for calculating the parameters of the evaluation results are given in the Decision Tree Regression Model section.
The data from the table indicate that the values of MSE, RMSE, MAE and MAPE are all low, and \(R^{2}\) is close to 1. Therefore, it can be concluded that the CBRF prediction model shows very good predictive performance on the dataset of men's singles matches from the second round of the 2023 Wimbledon Tennis Open. This shows that the model can accurately predict the direction of matches based on the momentum of the athletes, which proves the effectiveness of the CBRF prediction model in this task.
We use Python visualizers for simulation, and it can be analyzed from Fig. 6 that the prediction results for both training and test sets are similar. This indicates that the CBRF prediction model has good generalization ability and does not exhibit overfitting or underfitting phenomena.
Visualization of score probability
By setting up the CBRF prediction model, we were able to determine who is likely to be the scorer as a result of momentum shifts. However, we are not yet able to visually represent which player is performing better at a specific moment in the match and to what extent. Therefore, we propose to construct the following model to describe this:
$$\begin{aligned} \theta _{1}= & {} \left( 2-{\hat{y}}^{(CBRF)} \right) \times 100\%,\\ \theta _{2}= & {} \left( 1-{\hat{y}}^{(CBRF)} \right) \times 100\%, \end{aligned}$$
Where \(\theta_{1}\) represents the probability that player 1 wins, \(\theta_{2}\) represents the probability that player 2 wins.
We visualize the probability \(\theta_{1}\) And \(\theta_{2}\) To express this concept, as shown in Figure 7 below, we used Python to visualize and analyze the data.
In the graph you can see the probability of each player scoring at a specific time interval. For example, in the case of player 1 during the match at 1301 s, the probability of winning is 3.72%, while at 4806 s the probability of winning jumps to 95.74%.
