Reinforcement learning with game

Jiaxin Tong
4 min readDec 3, 2020

Reinforcement learning is on the rise. One of the challenges of reinforcement learning is training an agent, we need to have an environment first. OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. Here is an implementation of reinforcement learning training with LunarLander-v2. This is an episode before training.

There are four actions in game: left orientation engine main engine right orientation engine nothing

RL algorithms are based on Markov Decision Process. The main actors of a RL algorithm are Agent, Environment, Set of states (S), Set of actions (A),State transition model P (s_0|s , a), Reward (r = R(s , a)), episode. Value function V(s) is the expected long-term return at the end of the episode. Q-Value or Action-Value function Q(s , a) is the expected long-term return at the end of the episode, starting from state s at current timestep, performing action a.

The Bellman equation is the theoretical core in most RL algorithms. The current value function is equal to the current reward plus itself evaluated at next step and discounted by γ.

The main methods to approach RL are Value-based methods, Policy-based methods and Model-based methods. We use DQN which is value based. In Q-Learning algorithm, the formula to calculate the experience score is as following:

Using a neural network to represent the value function approximation of 𝑄(𝑠,𝑎). We can take the state and action as the input of the neural network, and then get the Q value of the action after the analysis of the neural network. In this way, it is not necessary for us to record the Q value in the table, but use the neural network to generate the Q value directly. In another form, we can only input the state value, output all the action values, and then directly select the action with the maximum value as the next action according to the principle of Q Learning. The second form is generally used. What we used in DQN function is as following.

n_episodes: maximum number of training episodes
max_t: maximum number of time steps per episode
eps_start: starting value of epsilon, for epsilon-greedy action selection
eps_end: minimum value of epsilon
eps_decay: multiplicative factor (per episode) for decreasing epsilon

Change hyperparameters to decrease number of episodes. Environment solved in 790 episodes in total. Average score is 201.42. We also optimize this using Double DQN, which is created by DeepMind. Because target networks tend to overestimate Q, the target model always selects the maximum Q value which is a little bit above the mean. To solve this problem, using the online model instead of the target model for the best action in the next state and only use the target model to estimate the Q value of the best action. The training steps as following:

Environment solved in 261 episodes in total. Average score is 200.68.

Above is an episode after training.

We use DQN and DDQN to train agent. There are also some popular RL algorithms such as Actor-critic, Asynchronous Advantage Actor-critic(A3C), Advantage Actor-critic(A2C) and Proximal Policy Optimization(PPO). Reinforcement learning is a huge and exciting area. It’s delightful to look forward to more exploration.


Jiaxin Tong | LinkedIn

Abhishek Maheshwarappa | LinkedIn