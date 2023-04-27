More recent applications in the field of complex control included the application of reinforcement learning (RL) [16,17,18]. It is one of the three basic paradigms of machine learning, alongside supervised learning and unsupervised learning. Recently, machine learning applications have developed rapidly in the upstream field. This transforms our ability to describe complex systems from observational data, rather than from first-principles-based modeling.

Obviously, reinforcement learning is different from unsupervised learning. While the latter focuses on extracting patterns and useful information hidden in unlabeled data, reinforcement learning has the ability to map inputs to output. Although supervised learning and reinforcement learning both work on mapping inputs to outputs, in reinforcement learning a learner and a decision maker called an agent is trained to find the optimal policy, which acts as a feedback information for the agent and gives RL an advantage in optimization problems. . Therefore, supervised and unsupervised goals have different purposes for which they are applied. While supervised learning can be used to solve nonlinear prediction and classification problems [19,20,21,22,23,24]reinforcement learning is used for data-driven optimization [25,26,27,28].

The application of machine learning techniques, especially reinforcement learning, has shown great promise for dynamic optimization tasks in recent years. The primary goals of these approaches are typically twofold: (1) to develop control policies that allow efficient mapping of sensor inputs to actuation outputs and (2) to support real-time decision making, even in complex environments. and changing.[29].Thereafter, machine learning is well suited to deal with high-dimensional nonlinear optimization problems that are difficult to model explicitly. Reinforcement learning, in particular, allows agents to learn from their interactions with the environment and adapt their behavior accordingly, making it a powerful tool for control and decision-making in complex systems.

For data-driven optimization, evolutionary algorithms could also be used, but as mentioned above, these are meta-heuristic search algorithms that mirror the process of natural selection, where the fittest individuals are selected for reproduction in order to produce next generation offspring, while reinforcement learning is structured as an agent interacting with an environment.

By defining a certain optimization task, RL relates to a specific type of optimization problem in which the goal is to find policies (strategies) that maximize the yield while an agent interacts with an environment in time steps. On the other hand, evolutionary algorithms are self-learning algorithms that can be applied to any optimization problem where you can code solutions, define a fitness function that compares solutions, and stochastically modify those solutions.

In this study, we will focus on reinforcement learning applications in closed-loop reservoir management. In general, RL applications are still very few and limited in engineering processes [18, 30, 31] as the respective algorithms are still in their development phases. Additionally, the research community is still working on validation and sharing issues. [32]. In the field of energy specifically, the systems are too complex to be simulated, and it is also too risky to let the algorithms interact directly with the production systems. [33]. However, RL algorithms may have a welcome rule in the further development of artificial intelligence. Therefore, interest in these algorithms has continued over the past decade. [34]. Some researchers have looked into the application of reinforcement learning in the field of energy. These applications mainly had decision management or flow control objectives. Field applications are mainly classified into two groups: either to control autonomous drilling or to optimize production.

As part of the autonomous monitoring of boreholes [35,36,37], the objective is to follow certain setpoints; therefore, it is often either a penalty function for downhole pressure in the case of pressure-controlled drilling, or for landing position, final inclination, and maximum curvature in the case of controlled directional drilling positioning.

In the context of production optimization, the general setup of closed-loop reservoir simulation consists of two main parts as follows: (1) model-based data assimilation, which acts as parameters of reservoir and states an estimator, and (2) a model-based optimizer. As mentioned above, the task of the optimizer is to maximize the oil recovery factor or other desired economic criteria such as NPV. The inputs required for the optimizer part can be total field metrics such as injection and production data or volumetric data from grids such as pressure and water saturation in well windows. The goals are often to maximize production, minimize waste, increase net present value, and reduce wasted time. Therefore, all these applications used the net present value as the reward function for their agent to find the optimal policy. However, they have diverse objectives on reservoir simulation foresight.

For flood optimization projects, there are three attempts [38,39,40]. These attempts differ either in the definition of the state, or in the complexity of the model, or in the reinforcement learning algorithm applied. Hourfar et al. and Ma et al. [38, 39] used the estimation of residual oil in percentages as well as the values ​​of water cuts in the total produced fluid. In contrast, Miftakhov et al. [40] used pixels from the reservoir simulation model to define the state. It works fine in this case, maybe because the model was so simple, but we expect such a state definition to lead to the curse of dimensionality due to the large number of inputs in such a case.

For steam-assisted gravity drainage, unlike SARSA which was used by Guevara et al. [15], we propose the use of the actor-critic reinforcement learning algorithm. It combines the advantages of values-based and policy-based approaches. In this algorithm, actions are generated by a political neural network. This network is evaluated based on a corresponding change in state potential. The values ​​of the state potential are given by another neural network called critical network. This network approximates the cumulative reward expected from the given state. The actor-critic algorithm is also recommended for a non-Markov competitive environment where stochastic action may be preferable to deterministic action. It is also used in the case of continuous action spaces like continuous robotic control [41].

There are other various applications such as the use of reinforcement learning to optimize injection rates in carbon dioxide storage. [42] unlike traditional methods and evolutionary algorithms used in the literature [43, 44], the agent could maximize the reward which is constructed based on risk and economic factors. There are also other applications in optimal well placement and selection of well types in the field of pressure transient test interpretation. [45,46,47].

In this study, we present a novel approach to production optimization in reservoir simulation through the application of reinforcement learning. We focus specifically on the challenges associated with data-driven optimization, which differ from those of data-driven modeling. By prioritizing this focus, we emphasize the key message of our work: optimizing production through data-driven techniques requires a distinct set of considerations and approaches. We believe this focus on the unique challenges of data-driven optimization is crucial to advancing autonomous drilling and production optimization. Our work represents a significant contribution in this field.

In this work, model-free reinforcement learning (RL) is applied for steam injection rate optimization. It was chosen because it overcomes the shortcomings of the above methods in two ways. First, it does not require a complete description of the process to be optimized. Second, this approach takes advantage of past experiences or interactions with the environment to find an optimal injection rate policy to maximize the net present value without human interference. Moreover, among reinforcement learning algorithms, we explore the use of actor-criticism (A2C). The advantages of such an algorithm are: [48]:

1 Sample efficiency: Actor-Critic is more efficient than some other reinforcement learning algorithms, such as Q-learning and SARSA. Indeed, it uses both the value function and the policy to learn, which can help reduce the number of necessary interactions with the environment.

2 Convergence to optimal policy: Actor-Critic has a solid theoretical basis for convergence to optimal policy, and it guarantees improvement with each iteration. This means that the performance of the policy will improve over time.

3 Handling non-stationary environments: Actor-Critic can handle non-stationary environments where the distribution of data changes over time. This is because the policy is updated using its own experience and the critic’s value function is updated using the time difference error.

4 Ability to balance exploration and exploitation: The actor-critic can balance exploration and exploitation using the policy and value function. Politics is used to explore the environment and discover new solutions, while the value function is used to exploit current knowledge of the environment and make decisions based on the expected return of each action.