Controlling Agents by Constrained Policy Updates

Authors

  • Mónika Farsang Budapest University of Technology and Economics
  • Luca Szegletes Budapest University of Technology and Economics

DOI:

https://doi.org/10.52846/stccj.2021.1.2.24

Keywords:

Control, Constrained policy, Proximal Policy Optimization, Reinforcement learning

Abstract

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.

References

V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. [Online]

D. Silver et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online].

S. Risi and M. Preuss, “From chess and atari to starcraft and beyond: How game AI is driving the world of AI,” K¨unstliche Intell., vol. 34, no. 1, pp. 7–17, 2020. [Online].

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 2019.

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Trans. Graph., vol. 37, no. 4, jul 2018. [Online].

J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013. [Online].

J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Muller, Eds., vol. 12. MIT Press, 2000. [Online]. Available: https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Muller, Eds., vol. 12. MIT Press, 2000. [Online]. Available: https://proceedings.neurips.cc/paper/1999/file/ 6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, vol. 37, 2015, pp. 1889–1897.

Z. Wang et al., “Sample efficient actor-critic with experience replay,” 2017.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.

I. Szita and A. L¨orincz, “Learning tetris using the noisy cross-entropy method,” Neural Computation, vol. 18, no. 12, pp. 2936–2941, 2006.

V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, p. 1928–1937.

M. Farsang and L. Szegletes, “Decaying clipping range in proximal policy optimization,” in 15th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2021, Timisoara, Romania, May 19-21, 2021. IEEE, 2021, pp. 521–526. [Online].

G. Brockman et al., “Openai gym,” CoRR, vol. abs/1606.01540, 2016. [Online]. Available: http://arxiv.org/abs/1606.01540

E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation in robotics, games and machine learning,” 2017. [Online]. Available: http://pybullet.org

Downloads

Published

2021-12-31

How to Cite

[1]
M. Farsang and L. Szegletes, “Controlling Agents by Constrained Policy Updates”, Syst. Theor. Control Comput. J., vol. 1, no. 2, pp. 33–39, Dec. 2021, doi: 10.52846/stccj.2021.1.2.24.
Received 2021-11-28
Accepted 2021-12-30
Published 2021-12-31