Reinforcement Learning — Apply AI in open environment

Reinforcement learning is types of neural network that overcome the problem that other learning methods can not solve with open environment. Let’s review about basic machine learning and learn how to apply reinforcement learning method through Flappy Bird and Mario games.


Artificial Intelligence.
Feed forward, fully-connected neural network.
Feed forward and back propagation.
Supervised and Unsupervised Learning.
Types of Machine Learning.

Apply Reinforcement Learning in Flappy Bird

Requirements of core game:

  • python
  • numpy, pandas
  • pygame
def sigmoid(x):
return 1/(1+np.exp(-x))
def relu(x):
return np.maximum(0, x)
def sigmoidDer(x):
return sigmoid(x)*(1. - sigmoid(x))
def reluDer(self, x):
return 1 * (x > 0)
def __init__(self, w1=None, w2=None):
self.inputNode = 3
self.outputNode = 1
self.hiddenLayerNode = 5
self.w1 = w1
self.w2 = w2
self.hOut = None
self.fOut = None
def forward(self, X):
# inputs to hidden layer
hOut = sigmoid(, X))
self.hOut = hOut
# final output
fOut = sigmoid(, hOut))
self.fOut = fOut
return fOut
def backward(self, X, error):
deltaError = error*sigmoidDer(self.fOut)
hiddenError =, np.transpose(self.w2))
deltaHidden =
self.w1 +=, deltaHidden)
self.w2 +=, deltaError)
Flappy Bird.
  • The vertical position of next pipe
  • The distance
  • The height of the bird
def __init__(self):
self.bestSOFar = 0
self.score = 0
self.gen = 0
self.position = None
self.brain = net()
self.distance = None
self.expectedPosition = None
self.nextHole = None
self.weight = np.random.rand(20,1)
self.gen = 0
def initialize(self, position, distance, nextHole, expectedPosition):
self.position = position
self.distance = distance
self.nextHole = nextHole
self.expectedPosition = expectedPosition
def move(self):
self.score += 1/30
jump = self.think()
# print(jump)
if jump >= 0.705:
return True
return False
def think(self):
inputLayer = [self.position, self.distance, self.nextHole]
maxValue = abs(max(max(inputLayer),min(inputLayer), key=abs))
inputLayer = np.divide(inputLayer,maxValue)
jump = self.brain.forward(inputLayer)
return jump
def learn(self):
self.brain.backward([self.position, self.distance, self.nextHole], self.position - self.expectedPosition)
self.weight = self.brain.encode()
# print(self.weight)
def printBotStat(self):
print('stat:\n{}\n{}\n{}'.format(self.distance, self.speed, self.upperPipe))
def dump(self):
fil = open('weight.json', 'w')
json.dump(self.weight, fil)
def increaseGen(self):
self.gen += 1
for i in range(population):
model = Sequential()
model.add(Dense(output_dim=7, input_dim=3))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="mse", optimizer=sgd, metrics=["accuracy"])
  • Step 1: Create 50 bots, each has its own neural network, and let the bot play, so we change a litte bit with neural network, we also need an array to store fitness to know which bird get the best result.
  • Step 2: Choose 2 bots have the best results. We know that this is the best according to fitness function. With Flappy Bird, we use the survival time.
  • Step 3: After having 2 bots, we use their weights to generate next generation, with crossover and mutate.
  • Then do the step again and again to find the best weight, without back propagation.
Deep Q Learning.
"-30_20_10": [124.90673597144435, 0]
  • -30: distance
  • 20: vertical difference = vertical cordination of pipe — vertical cordination of bird
  • 10: drop velocity

Q Learning characteristic

  • All the past experience is stored by the user in memory
  • The next action is determined by the maximum output of the Q-network
  • The loss function here is mean squared error of the predicted Q-value and the target Q-value — Q*. This is basically a regression problem. However, we do not know the target or actual value here as we are dealing with a reinforcement learning problem.

Conclusion on Q Learning

  • Deep Q Learning is a solution for a finite input state.
  • The reward is surviving as long as possible. So the bird need to survive, that’s our policy. According to the state, the bird will itself evaluates the q-value of next stage depend on all next states.
  • Generic network is quite good, however, there are many random fields effect our performance.
  • Deep reinforcement learning is good with open environment, it will explore the environment.


Apply Reinforcement Learning in Super Mario

Asynchronous Advantage Actor-Critic Algorithms Idea

Explanation in layman’s term (Quoted from Viet Nguyen best explanation)


Advantage Actor-Critic

Asynchronous Advantage Actor-Critic

  • Create 6 workers, each workers has its own network
  • Create 6 evaluator for 6 workers, and 1 for global, to see how well the bot is doing
  • And we hope that each bot will “learn” a little from environment to contribute to our global net
  • Use a convolution neural network to frame as input of the A3C, we call frame state

Conclusion on A3C

  • A3C is a CPU based algorithms, using far less resource than many massive method.
  • By asynchronously launching more workers, you are essentially going to collect as much more training data, which makes the collection of the data faster.
  • Since every single worker instance also has their own environment, you are going to get more diverse data, which is known to make the network more robust and generates better results! No behavior replay, so no need much resource.


Extend the problem

  • In robotics — to efficiently find a combination of electrical signals to steer robotic arms (to perform an action) or legs (to walk)
  • In manufacturing — robots for package transportation or for assembling specific parts of cars
  • In military — among others for logistics and to provide automatic assistance for humans in analyzing the environment before actions
  • In inventory management — to reduce transit time and space utilization, or to optimize dispatching rules Power systems — to predict and minimize transmission losses
  • In the financial sector — for instance in the trading systems to generate alpha or to serve as an assistant to allow traders and analysts to save time
  • Steering an autonomous helicopter
  • For complex simulations — e.g. robo soccer
  • In marketing — for cross-channel campaign optimisation and to maximize long-term profit of a campaign
  • AdTech — optimizing a strategy of pricing Ads in order to maximize returns within a given budget
  • Energy management optimization — reducing data center energy cost by 40%!



Neurond AI is a transformation business.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store