# Deceptive Path Planning with Reinforcement Learning

22 March 2019

Table of Contents

Applied Q-learning and deep Q-learning to solve deceptive path planning problems. A model-free solution, but reached above 90% performance of the model-based approach. Implemented with Python, NumPy and TensorFlow.

The deceptive path planning problem is to generate a path that the likelihood of an observer successfully identifying the destination is minimised.

Along the path, there is one point of particular interest, which is called the last deceptive point (LDP). After this point, all the sequent steps are no longer deceptive. The main objective of deceptive path planning is to make the LDP as late as possible.

## Related Work

Masters and Sardina ^{1} formalised the deceptive path planning problem and introduced the concepts of last deceptive point (LDP) and radius of maximum probability (RMP). They also proposed a model-based solution using the A* algorithm to find a path and the cost-difference algorithm to recognise the goal with maximised LDP and minimised path cost. This solution is effective but it is model-based thus can only work in known maps and environment.

### LDP and RMP

An observer makes a sequence of observations $O$ and produces a posterior probabilities $P(G|O)$ of goals. When the agent takes a step, if the probability of the real goal $P(g_r |O)$ is less than or equal to the probability of any other possible goal $P(g|O)$, $g \in G \setminus \{g_r\}$, this step is called a deceptive step. Otherwise, this is a truthful step.

For a sequence of steps in a path, there is one deceptive step of particular interest: after this step, all subsequent steps are truthful. This particular step is called last deceptive point (LDP). The primary goal of deceptive path planning is to delay the LDP as late as possible.

Apparently, the LDP in a certain problem $\Pi$ cannot be delayed until infinity. For the real goal $g_r$ , if the agent steps in to a range of certain radius, it will certainly be exposed to the observer. The radius can be formally quantified as the radius of maximum probability ($\text{RMP}_{g_r}$).

### Goal Recognition in Path Planning

The theoretical basis of the observer is goal recognition in path planning problems: a rational agent is always trying to find an optimal or sub-optimal path to the goal. One of the simple but effective approaches is to calculate the cost difference between the cheapest plan that incorporates the observations and the cheapest plan that avoids them. ^{2}
A probability distribution can be generated by comparing cost differences across all possible goals.

## Method

With reinforcement learning methods, a deceptive path planning solution consists of a agent and an observer.

### Q-Learning Based Agent

Q-Learning is a value-based reinforcement learning algorithm. The method aims to get good Q-values for state-action pairs through updating Q-values iteratively in the process of exploring the environment and receiving rewards from the environment.

The agent will receive three kinds of reward from the environment: a positive reward for reaching the real goal, a major negative reward for exposure of the real goal, and a minor negative reward for time cost. The exposure of the real goal is determined by comparing the real goal and the predicted real goal using cost-difference algorithm. The value of the reward scales in proportion to the size of the map.

Only the coordinates $(x,y)$ are taken into consideration for each state $s$. The path traversed from the start to the current position (moving history) and other factors are not included in the state, thus the state space could be kept in a feasible size.

```
def getAction(self, state):
if self.episodesSoFar<self.numTraining:
min_epsilon = 0.05
self.epsilon = min_epsilon + (1.0 - min_epsilon) * math.exp(-0.05*self.episodesSoFar/200)
else:
self.epsilon = 0.0
self.alpha = 0.0
legalActions = self.getLegalActions(state)
action = None
if not len(legalActions):
return action
randomAction = util.flipCoin(self.epsilon)
if randomAction:
action = random.choice(legalActions)
else:
action = self.getPolicy(state)
return action
def update(self, state, action, nextState, reward):
position = state.getPacmanPosition()
currentQValue = self.getQValue(state, action)
self.qValues[(position, action)] = \
(1 - self.alpha) * currentQValue + \
self.alpha * (reward + self.discount * self.getValue(nextState) )
```

With a fixed small exploration rate $\varepsilon$, the agent can easily find an optimal path rather than a deceptive path. Therefore, in the $\varepsilon$-greedy policy, $\varepsilon$ is progressively reduced starting from a big initial value $1.0$ so that the agent will explore the environment enough to learn a path both optimal and deceptive.

The deep Q-learning agent is an attempt of improving the Q-learning agent, by replacing the Q-tables with neural networks. The network consists of four layers: two convolutional layers then two fully connected layers. The input is map information represented by matrices (1 for existence and 0 for none), including the walls, obstacles, goals and agent state. The reward settings and the training process are similar to the Q-learning agent.

```
class DeepQNetwork:
def __init__(self, width, height, depth, action_size, alpha, gamma, ckpt_file=None, name='DQN'):
self.width = width
self.height = height
self.depth = depth
self.action_size = action_size
self.alpha = alpha
self.gamma = gamma
self.ckpt_file = ckpt_file
with tf.variable_scope(name):
self.sess = tf.Session()
self.x = tf.placeholder('float', [None, width, height, depth],name='x')
self.q_t = tf.placeholder('float', [None], name='q_t')
self.actions = tf.placeholder("float", [None, action_size], name='actions')
self.rewards = tf.placeholder("float", [None], name='rewards')
self.terminals = tf.placeholder("float", [None], name='terminals')
"""
Layer 1 (Convolutional)
"""
layer_name = 'conv1'
size = 3
channels = depth
filters = 16
stride = 1
self.w1 = tf.Variable(tf.random_normal([size,size,channels,filters], stddev=0.01),name=layer_name+'_weights')
self.b1 = tf.Variable(tf.constant(0.1, shape=[filters]),name=layer_name+'_biases')
self.c1 = tf.nn.conv2d(self.x, self.w1, strides=[1, stride, stride, 1], padding='SAME',name=layer_name+'_convs')
self.o1 = tf.nn.relu(tf.add(self.c1,self.b1),name=layer_name+'_activations')
"""
Layer 2 (Convolutional)
"""
layer_name = 'conv2'
size = 3
channels = 16
filters = 32
stride = 1
self.w2 = tf.Variable(tf.random_normal([size,size,channels,filters], stddev=0.01),name=layer_name+'_weights')
self.b2 = tf.Variable(tf.constant(0.1, shape=[filters]),name=layer_name+'_biases')
self.c2 = tf.nn.conv2d(self.o1, self.w2, strides=[1, stride, stride, 1], padding='SAME',name=layer_name+'_convs')
self.o2 = tf.nn.relu(tf.add(self.c2,self.b2),name=layer_name+'_activations')
o2_shape = self.o2.get_shape().as_list()
"""
Layer 3 (Fully connected)
"""
layer_name = 'fc3'
hiddens = 256
dim = o2_shape[1] * o2_shape[2] * o2_shape[3]
self.o2_flat = tf.reshape(self.o2, [-1,dim],name=layer_name+'_input_flat')
self.w3 = tf.Variable(tf.random_normal([dim,hiddens], stddev=0.01),name=layer_name+'_weights')
self.b3 = tf.Variable(tf.constant(0.1, shape=[hiddens]),name=layer_name+'_biases')
self.ip3 = tf.add(tf.matmul(self.o2_flat,self.w3),self.b3,name=layer_name+'_ips')
self.o3 = tf.nn.relu(self.ip3,name=layer_name+'_activations')
"""
Layer 4
"""
layer_name = 'fc4'
hiddens = action_size
dim = 256
self.w4 = tf.Variable(tf.random_normal([dim,hiddens], stddev=0.01),name=layer_name+'_weights')
self.b4 = tf.Variable(tf.constant(0.1, shape=[hiddens]),name=layer_name+'_biases')
self.y = tf.add(tf.matmul(self.o3,self.w4),self.b4,name=layer_name+'_outputs')
```

### Q-Learning Based Observer

Inspired by the model-based cost-difference observer, we build a Q-learning based observer which can also give a prediction of the real goal for each step the Pac-Man agent taken.

The observer should be trained for each map before the training of planning agents. Basically, the Q-learning based observer is a set of reinforcement learning agents that try to learn optimal but not deceptive paths to each possible goal. In the $\varepsilon$-greedy policy, the epsilon should be set to a value higher than usual, in order to enforce the exploration. This training process could be running in parallel so the observer could be trained in constant time despite the number of goals in the map.

For a map containing $n$ goals, the observer should have $n$ Q-tables corresponding to the goals in the map after training. Each Q-table contains the information about finding an optimal path to the corresponding goal. When the observer observe a move of agent from state $s_0$ to $s_0'$, it can calculate the coordinate change and infer the action $a_0$ taken by the agent. Then the observer will search through all $n$ Q-tables and find out in which Q-table $\arg\max_{a} Q_i(s,a)=a_0$. The returned index $i$ is also the index of goals that the observer predicts to be the real goal.

## Evaluation

The experimental environment is a Pac-Man Game based on the UC Berkeley CS188 AI Project. A Pac-Man is acting as the planning agent following a deceptive path to eat the food dots. Six $25\times25$ game maps are generated manually, each containing 2-4 goals.

The main goal of this experiment is to test the deceptivity that the paths each strategies produced and compare the performance of those strategies. There are two aspects of metrics for evaluation, the relative efficiency (time / cost) of the each strategy and the effectiveness (deceptivity) of paths they can produce.

Path cost | Gen. time (s) | 10% | 25% | 50% | 75% | 90% | 99% | |
---|---|---|---|---|---|---|---|---|

No-observer | 24.0 | 346.90 | 55 | 51 | 48 | 44 | 42 | 42 |

Q-learning based observer | 34.8 | 313.01 | 100 | 100 | 100 | 99 | 96 | 93 |

PG based observer | 36.4 | 478.05 | 100 | 100 | 100 | 98 | 96 | 91 |

NN Classifier Observer | 30.9 | 376.54 | 100 | 97 | 96 | 93 | 90 | 87 |

# Episodes | Time (min) | |
---|---|---|

Q-learning Observer | 2000 | 1.42 |

Policy Gradient Observer | 10000 | 104.63 |

NN-Observer | 3000 | 2.54 |

The Q-learning agent works well with the Q-learning based observer. This combination achieves a good balance between path deceptivity and computational complexity. The space complexity of Q-table based agent is in proportional to the area of the map, which is feasible for all test cases. While a step limit is set for each episode, the training time is stable for all maps in similar sizes.

For all test cases, all predictions from the Q-learning based observer are compared with a model-based cost-difference observer which is running simultaneously in background. (The predictions from model-based observer are recorded for comparison but not for agent training.) To make a quantitative comparison, we count the number of different predictions in each path and or correct percentage above 100% for more accurate predictions than the model-based observer, percentage below 100% for less accurate predictions.

The Q-learning based observer achieved similar performance as the model-based observer. One of the advantages of Q-learning based observers is that it makes predictions not only based on the position of the Pac-Man, but also based on the last action that the Pac-Man taken. Compared to the model-based cost-difference observers, Q-learning based observer is more likely to detect deceptive behaviours earlier by combining the information of Pac-Man’s moving direction. However, because it learns this information by stochastic simulation of Pac-Man’s behaviours, the observer makes predictions with more uncertainty and inconsistency. From a case by case analysis of each map, we can observe some unexpected isolated correct predictions made by the Q-learning observer, which are rarely seen in the results of model-based observers.

## Conclusion & Future Work

In this project, we have achieved a model-free approach solving the deceptive path planning problem with reinforcement learning. We first developed a non-observer solution with Q-learning, which can produce paths of around 50% deceptivity. Then we improved the deceptivity by introducing a separate reinforcement learning observer. Three different algorithms have been implemented and evaluated, including Q-learning, policy gradient and neural network classifier, and each of them can increase the overall deceptivity to about 90%.

In future work, we will explored the potential of neural network based approach, by improving the structure of networks and tuning the parameters with more experiments. We also wish to adopt the idea from GAN and train the agent and observer simultaneously.

*This project was finished with J. Li, C. Yu, and Y. Wang, under supervision of Prof. Tim Miller. The source code related to this project is hosted on GitHub, but is not publicly available yet.*