This thesis proposes a novel algorithm for use in reinforcement learning problems
where a stochastic time delay is present in an agent's reinforcement signal. In these
problems, the agent does not necessarily receive a reinforcement immediately after the
action that caused it. We relax previous constraints in the literature by assuming that
rewards may arrive to the agent out of order or may even overlap with one another.
The algorithm combines Q-learning and hypothesis testing to enable the agent to
learn about the delay itself. A proof of convergence is provided. The algorithm
tested in a grid-world simulator in MATLAB, the Webots mobile-robot simulator,
and in an experiment with a real e-Puck mobile robot. In each of these test beds,
the algorithm is compared to Watkins' Q-learning, of which it is an extension. In all
cases, the novel algorithm outperforms Q-learning in situations where reinforcements
are variably delayed.