The goal of the agent training is to adjust the parameters of the ANN, so that the ANN ideally predicts the consequences of a potential action up till the end of the game. The ideal ANN should output a value
that correctly predicts the chances of the current player to win the game. The output
should be ideally a sharp distribution with a maximum at the action that maximizes the chances to win the game. The training procedure for approaching such behavior is outlined in the following. At the start of the training procedure, the ANN is initialized with random weights. In training, the agent plays a large number
of games against itself. The given feed stream(s) in the games can be varied randomly to obtain an agent that is able to solve a broad class of problems. For example, if an agent is desired that can separate a quaternary mixture for all possible feed compositions, then the feed compositions should be broadly sampled in the entire composition space during the training process. At the beginning of every game, the search tree is initialized with the given feed(s). Then the agent plays the game until the end (both players terminated their flowsheets). Thereby, every decision that had been made in step 4 of the tree search is stored. Stored are the state vector
at the root and the vector
of the decision. After finishing the game, the decision data is augmented by the final reward
that has been obtained at the end of game. The tuples of the form (
) are stored in a memory of size
. The oldest tuples are replaced in the memory, if the memory is full. After every game, a batch of
tuples is sampled randomly out of the memory. With this batch, the parameters of the ANN are optimized using stochastic gradient descent (SGD) [
31]. Two optimization steps are performed. The first one with respect to the loss function
and the second one with respect to the loss function
: