Kamil Ciosek


MSR Cambridge
21 Station Rd,
Cambridge CB1 2FB, UK

My photograph
About Me

I am a machine learning (ML) researcher with a focus on reinforcement learning (RL). My work on solving MDPs has led me to believe that the most promising way of achieving a more powerful artificial intelligence is to combine insights form previously separate fields. For example, RL can be thought of as a sample-based variant of classical control. Similarly, deep learning, originally developed in the context of supervised learning, has become virtually ubiquitous throughout ML. More generally, I also have an interest in optimization, probabilistic modelling and statistics.

Expected Policy Gradients

My most significant result (AAAI18 paper with Prof. Whiteson) to date has been the development of expected policy gradients (EPG), a method which unifies stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. To see how it works, denote the policy pdf as \( \pi \), the critic as \( \hat{Q} \), the baseline as \( b(s) \) and consider the integral for the Policy Gradient update:

Now, classical stochastic policy gradients (SPG) estimate this integral as follows:

Inspired by expected sarsa, the main idea behind EPG is to notice that the integral

is given entirely in terms of known quantities and hence that can be solved. EPG then uses policy gradients of the form:

A completely analytic solution for \( I(s) \) is possible for a Gaussian policy and a quadric critic (see paper for approximations if that is not the case). Since the Monte-Carlo estimator is no longer required, we obtain a reduction in variance.

EPG also leads to an interesting result on exploration – it turns out that it is a good idea to explore with a Gaussian policy, where the covariance is proportional to \(e^H\), where \(H\) is the scaled Hessian of the critic with respect to the actions.

On the theoretical side, EPG also provides a general framework for reasoning about policy gradient methods, which I used to establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases.

An extended version of this result has been submitted for review for publication in JMLR.

My AAAI-18 talk about EPG can be found on YouTube: please click below to watch.

  1. Fellows, M., Ciosek, K., & Whiteson, S. (2018). Fourier Policy Gradients. In ICML. pdf
      author = {Fellows, Matthew and Ciosek, Kamil and Whiteson, Shimon},
      booktitle = {ICML},
      date-added = {2018-02-25 15:17:14 +0000},
      date-modified = {2018-06-04 12:39:30 +0000},
      eprint = {1802.06891},
      pdf = {https://arxiv.org/pdf/1802.06891},
      title = {{F}ourier {P}olicy {G}radients},
      year = {2018}
  2. Paul, S., Chatzilygeroudis, K., Ciosek, K., Mouret, J.-B., Osborne, M. A., & Whiteson, S. (2018). Alternating Optimisation and Quadrature for Robust Control. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). pdf
      author = {Paul, Supratik and Chatzilygeroudis, Konstantinos and Ciosek, Kamil and Mouret, Jean-Baptiste and Osborne, Michael A. and Whiteson, Shimon},
      date-added = {2017-12-14 15:17:17 +0000},
      date-modified = {2017-12-14 16:10:14 +0000},
      journal = {The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)},
      pdf = {./dwnl/aloq.pdf},
      title = {{A}lternating {O}ptimisation and {Q}uadrature for {R}obust {C}ontrol},
      year = {2018}
  3. Ciosek, K., & Whiteson, S. (2018). Expected Policy Gradients. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). pdf
    (featured above)
      author = {Ciosek, Kamil and Whiteson, Shimon},
      date-added = {2017-12-14 15:15:13 +0000},
      date-modified = {2017-12-14 16:18:43 +0000},
      featured = {true},
      journal = {The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)},
      pdf = {./dwnl/ciosek-whiteson-epg.pdf},
      title = {{E}xpected {P}olicy {G}radients},
      year = {2018}
  4. Ciosek, K., & Whiteson, S. (2018). Expected Policy Gradients for Reinforcement Learning. Submitted for review for JMLR. pdf
      author = {Ciosek, Kamil and Whiteson, Shimon},
      date-modified = {2018-02-25 15:22:31 +0000},
      eprint = {arXiv:1801.03326},
      howpublished = {Submitted for review for JMLR},
      pdf = {./dwnl/ciosek-whiteson-epg-journal.pdf},
      title = {{E}xpected {P}olicy {G}radients for {R}einforcement {L}earning},
      year = {2018}
  5. Ciosek, K., & Whiteson, S. (2017). OFFER: Off-Environment Reinforcement Learning. The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). pdf
      author = {Ciosek, Kamil and Whiteson, Shimon},
      date-added = {2017-12-14 15:17:29 +0000},
      date-modified = {2017-12-14 16:18:23 +0000},
      journal = {The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)},
      pdf = {./dwnl/offer.pdf},
      title = {{OFFER}: {O}ff-{E}nvironment {R}einforcement {L}earning.},
      year = {2017}
  6. Ciosek, K., & Whiteson, S. (2016). Off-Environment RL with Rare Events. NIPS Workshop on Optimizing the Optimizers. pdf
      author = {Ciosek, Kamil and Whiteson, Shimon},
      date-added = {2017-12-14 16:17:37 +0000},
      date-modified = {2017-12-14 16:18:08 +0000},
      journal = {NIPS workshop on Optimizing the Optimizers},
      pdf = {./dwnl/offer.pdf},
      title = {{O}ff-{E}nvironment {RL} with {R}are {E}vents},
      year = {2016}
  7. Ciosek, K. (2015). Linear Reinforcement Learning with Options. (Ph. D. thesis). University College London. pdf
      author = {Ciosek, Kamil},
      date-added = {2017-12-14 15:20:33 +0000},
      date-modified = {2017-12-14 15:34:57 +0000},
      pdf = {./dwnl/phd-thesis.pdf},
      school = {University College London.},
      title = {{L}inear {R}einforcement {L}earning with {O}ptions.},
      type = {{P}h.\ {D}.\ thesis},
      year = {2015}
  8. Ciosek, K., & Silver, D. (2015). Value Iteration with Options and State Aggregation. In Proceedings of the 5th Workshop on Planning and Learning, ICAPS. pdf
      author = {Ciosek, Kamil and Silver, David},
      date-added = {2017-12-14 15:17:49 +0000},
      date-modified = {2017-12-14 16:19:13 +0000},
      journal = {In Proceedings of the 5th Workshop on Planning and Learning, ICAPS.},
      month = jun,
      pdf = {./dwnl/opt-aggr.pdf},
      title = {{V}alue {I}teration with {O}ptions and {S}tate {A}ggregation.},
      year = {2015}
  9. Silver, D., & Ciosek, K. (2012). Compositional Planning Using Optimal Option Models. In ICML. pdf
      author = {Silver, David and Ciosek, Kamil},
      booktitle = {ICML},
      date-added = {2017-12-14 15:24:19 +0000},
      date-modified = {2017-12-14 16:19:37 +0000},
      pdf = {./dwnl/composition.pdf},
      title = {{C}ompositional {P}lanning {U}sing {O}ptimal {O}ption {M}odels},
      year = {2012}
  10. Ciosek, K., & Kotowski, P. (2009). Generating 3D Plants using Lindenmayer System. In GRAPP (pp. 76–81). pdf
      author = {Ciosek, Kamil and Kotowski, Pawe{\l}},
      booktitle = {GRAPP},
      date-added = {2017-12-14 15:24:28 +0000},
      date-modified = {2017-12-14 16:19:49 +0000},
      pages = {76-81},
      pdf = {./dwnl/lindenmayer.pdf},
      title = {{G}enerating 3D {P}lants using {L}indenmayer {S}ystem},
      year = {2009}
Research for my Ph.D.

An overarching idea of my Ph.D. research (pdf) was the study of model-based RL with linear transition models. I used the concept of the joint spectral radius to develop a condition for when RL with compressed models is stable in a certain sense. I also gave a formula that quantifies the situation where a model compressed with a linear projection is optimal, which turns out to be an instance of the algebraic Riccati equation. Moreover, I described a span of well-known RL algorithms in a common matrix framework.

Second, I provided a detailed analysis of the Least Squares Temporal Differences (LSTD) algorithm. I also provided some new insights as to why LSTD can be considered to be a an oblique projection. Next, my work assembled in one place the several different optimization objectives equivalent to LSTD and provided an analysis of of the differences between LSTD and the Bellman Residual Minimization. Finally, I discussed the episodic version of LSTD.

Furthermore, I combined approximate value iteration and learning with options, for the first time in a convergent way. Using a new version of the Bellman optimiality equation, option models can be composed with one another. By using them as soon as they are constructed, large jumps in the state-space become possible. If combined with state aggregation, this produces faster convergence, helping scale options to medium-sized MDPs.

Earlier Work

Before starting my Ph.D., I did some early work on 3D graphics for my B. Sc. thesis (pdf). My first experience with Machine Learning was writing a Master’s thesis on the semi-supervised paradigm in datasets containing graphs as elements (pdf in German).