eng / pl

Paweł Wawrzyński

P.Wawrzynski, "ASD+M: Automatic parameter tuning in stochastic optimization and on-line learning", *Neural Networks,* vol. 96, pp. 1-10, 2017.

P.Wawrzynski, "Robot’s Velocity and Tilt Estimation Through Computationally Efficient Fusion of Proprioceptive Sensors Readouts", *Proceedings of the 10th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 738-743, 2015.

M.Majczak, P.Wawrzynski, "Comparison of two efficient control strategies for two-wheeled balancing robot", *Proceedings of the 10th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 744-749, 2015.

P.Wawrzynski, "Control policy with autocorrelated noise in reinforcement learning for robotics", *International Journal of Machine Learning and Computing,* Vol. 5, No. 2, pp. 91-95, IACSIT Press, 2015.

P.Wawrzynski, J.Mozaryn, J.Klimaszewski, "Robust estimation of walking robots velocity and tilt using proprioceptive sensors data fusion", *Robotics and Autonomous Systems,* Elsevier, Vol. 66, pp. 44-54, 2015.

J.Mozaryn, J.Klimaszewski, D.Swieczkowski-Feiz, P.Kolodziejczyk, P.Wawrzynski, "Design process and experimental verification of the quadruped robot wave gait", *Proceedings of the 9th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 206-211, 2014.

P.Wawrzynski, "Reinforcement Learning with Experience Replay for Model-Free Humanoid Walking Optimization," *International Journal of Humanoid Robotics,*Vol. 11, No. 3, pp. 1450024, 2014.

P.Wawrzynski, *Podstawy sztucznej inteligencji,*Oficyna Wydawnicza Politechniki Warszawskiej, 2014. (Eng. *Fundamentals of artificial intelligence,* Publishing House of Warsaw University of Technology, 2014.)

B.Papis, P.Wawrzynski, "dotRL: A platform for rapid Reinforcement Learning methods development and validation," *Proceedings of the Federated Conference on Computer Science and Information Systems,* pp. 129-136, 2013.

P.Wawrzynski, J.Mozaryn, J.Klimaszewski, "Robust velocity estimation for legged robot using on-board sensors data fusion," *Proceedings of the International Conference on Methods and Models in Automation and Robotics (MMAR), August 26-29, 2013, Międzyzdroje, Poland,* pp. 717-722, IEEE, 2013.

P.Suszynski, P.Wawrzynski, "Learning population of spiking neural networks with perturbation of conductances," *Proceedings of the International Joint Conference on Neural Networks, August 4-9, 2013, Dallas TX, USA,* pp. 332-337, IEEE, 2013.

P.Wawrzynski, A.K.Tanwani, "Autonomous Reinforcement Learning with Experience Replay," *Neural Networks 41,* pp. 156-167, Elsevier, 2013.

P.Wawrzynski, *Sterowanie Adaptacyjne i Uczenie Maszynowe - preskrypt wykładu,* Politechnika Warszawska, 2012. (Eng. *Adaptive Control and Machine Learning - printed series of course lectures,* Warsaw University of Technology, 2012.)

P.Wawrzynski, "Autonomous Reinforcement Learning with Experience Replay for Humanoid Gait Optimization," Proceedings of the International Neural Network Society Winter Conference (INNS-WC2012), pp. 205-211, Elsevier, 2012.

P.Wawrzynski, B.Papis, "Fixed point method for autonomous on-line neural network training," *Neurocomputing 74,* pp. 2893-2905, Elsevier, 2011.

P.Wawrzynski, "Fixed Point Method of Step-size estimation for on-line neural network training," *Proceedings of WCCI 2010 IEEE World Congress on Computational Intelligence, July 18-23, 2010, Barcelona, Spain,* IEEE, pp. 2012-2017.

P.Wawrzynski, *Systemy adaptacyjne i ucz±ce siê - preskrypt wykładu,* Oficyna Wydawnicza Politechniki Warszawskiej, 2009. (Eng. *Adaptive and learning systems - printed series of course lectures,* Publishing House of Warsaw University of Technology, 2009.)

P.Wawrzynski, "Real-Time Reinforcement Learning by Sequential Actor-Critics and Experience Replay," *Neural Networks 22,* pp. 1484-1497, Elsevier, 2009.

P.Wawrzynski, "A Cat-Like Robot Real-Time Learning to Run," *Lecture Notes in Computer Science 5495,* pp. 380-390, Springer-Verlag, 2009.

P.Wawrzynski, J. Arabas, P. Cichosz, "Predictive Control for Artificial Intelligence in Computer Games," *Lecture Notes in Artificial Intelligence 5097,* pp. 1137-1148, Springer-Verlag, 2008.

P.Wawrzynski, A.Pacut "Truncated Importance Sampling for Reinforcement Learning with Experience Replay," *Proceedings of the International Multiconference on Computer Science and Information Technology,* pp. 305-315, 2007.

P.Wawrzynski, "Learning to Control a 6-Degree-of-Freedom Walking Robot," *Proceedings of EUROCON 2007 The International Conference on Computer as a Tool,* pp. 698-705, 2007.

P.Wawrzynski, "Reinforcement Learning in Fine Time Discretization," *Lecture Notes in Computer Science 4431,* pp. 470-479, 2007.

P.Wawrzynski, A.Pacut, "Balanced Importance Sampling Estimation," Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU), Paris, July 2-7, 2006, pp. 66-73.

P.Wawrzynski, "Symulacja Płaskich łańcuchów Kinematycznych," Raport nr 05-06 Instytutu Automatyki i Informatyki Stosowanej, Listopad 2005. (Eng. "*Planar Kinematic Chain Simulation,*" Report no 05-06 of the Institute of Control and Computation Engineering, November, 2005. )

P.Wawrzynski, A.Pacut, "Reinforcement Learning in Quasi-Continuous Time," *Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation, November 2005, Vienna, Austria, * pp. 1031-1036.

P.Wawrzynski, "Intensive Reinforcement Learning,"Ph.D. dissertation, Institute of Control and Computation Engineering, Warsaw University of Technology, may 2005.

P.Wawrzynski, A.Pacut, "Model-free off-policy reinforcement learning in continuous environment," *Proceedings of the International Joint Conference on Neural Networks,Budapest, July 2004, * pp. 1091-1096.

P.Wawrzynski, A.Pacut, "Intensive versus nonintensive actor-critic algorithms of reinforcement learning," *Lecture Notes in Artificial Intelligence 3070,* pp. 934-941, Springer-Verlag, 2004.

P.Wawrzynski, A.Pacut, "A simple actor-critic algorithm for continuous environments," *Proceedings of the 10th IEEE Int. Conf. on Methods and Models in Automation and Robotics, August 2004*, pp. 1143-1149.

P.Wawrzynski, P.Podsiadly, G.Lehmann, "IOT Methodology of Frequency Assignment in Cellular Network," *Proceedings of the MOST International Conference, October 2002, *pp. 313-324.

P.Wawrzynski, A.Pacut, "Modeling of distributions with neural approximation of conditional quantiles," *Proceedings of the 2nd IASTED Int. Conf. Artificial Intelligence and Applications, Malaga, Spain, September 2002, pp. 539-543.*

P.Wawrzynski, "ASD+M: Automatic parameter tuning in stochastic optimization and on-line learning", *Neural Networks,* vol. 96, pp. 1-10, 2017.
[DOI]
[link]

ABSTRACT: In this paper the classic
momentum algorithm for stochastic optimization is considered.
A method is introduced that adjusts coefficients for this
algorithm during its operation. The method does not depend
on any preliminary knowledge of the optimization problem.
In the experimental study, the method is applied to on-line
learning in feed-forward neural networks, including deep
auto-encoders, and outperforms any fixed coefficients. The
method eliminates coefficients that are difficult to determine,
with profound influence on performance. While the method itself
has some coefficients, they are ease to determine and sensitivity
of performance to them is low. Consequently, the method makes on-line
learning a~practically parameter-free process and broadens the area
of potential application of this technology.

Keywords - Stochastic gradient descent, classic momentum, step-size, learning rate, on-line learning, deep learning.

P.Wawrzynski, "Robot’s Velocity and Tilt Estimation Through Computationally Efficient Fusion of Proprioceptive Sensors Readouts", *Proceedings of the 10th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 738-743, 2015. [pdf]

ABSTRACT: In this paper a method is introduced that combines
Inertial Measurement Unit (IMU) readouts with low accuracy
and temporarily unavailable velocity measurements (e.g., based
on kinematics or GPS) to produce high accuracy estimates of
velocity and orientation with respect to gravity. The method is
computationally cheap enough to be readily implementable in
sensors. The main area of application of the introduced method
is mobile robotics.

Keywords - velocity estimation, Kalman filter, mobile robotics.

M.Majczak, P.Wawrzynski, "Comparison of two efficient control strategies for two-wheeled balancing robot", *Proceedings of the 10th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 744-749, 2015.[pdf]

ABSTRACT: The subject of this paper is a two-wheeled balancing
robot with the center of mass above its wheels. Two control
strategies for this robot are analyzed. The first one combines a
kinematic model of the robot and a PI controller. The second
one is a cascade of two PIDs. These strategies are compared
experimentally.

Keywords - mobile robots, inverted pendulum, cost-effective
robots.

P.Wawrzynski, "Control policy with autocorrelated noise in reinforcement learning for robotics",*International Journal of Machine Learning and Computing,* Vol. 5, No. 2, pp. 91-95, IACSIT Press, 2015. doi:10.7763/IJMLC.2015.V5.489.

ABSTRACT: Direct application of reinforcement learning in
robotics rises the issue of discontinuity of control signal.
Consecutive actions are selected independently on random,
which often makes them excessively far from one another. Such
control is hardly ever appropriate in robots, it may even lead to
their destruction. This paper considers a control policy in which
consecutive actions are modified by autocorrelated noise. That
policy generally solves the aforementioned problems and it is
readily applicable in robots. In the experimental study it is
applied to three robotic learning control tasks: Cart-Pole
SwingUp, Half-Cheetah, and a walking humanoid.

Index Terms - Machine learning, reinforcement learning,
actorcritics, robotics.

P.Wawrzynski, J.Mozaryn, J.Klimaszewski, "Robust estimation of walking robots velocity and tilt using proprioceptive sensors data fusion", *Robotics and Autonomous Systems,* Elsevier, Vol. 66, pp. 44-54, 2015. doi:10.1016/j.robot.2014.12.012.

ABSTRACT: Availability of the instantaneous velocity of a legged robot is usually required for
its efficient control. However, estimation of velocity only on the basis of robot kinematics
has a significant drawback: the robot is not in touch with the ground all the time, or its feet
may twist. In this paper we introduce a method for velocity and tilt estimation in a walking robot. This method combines a kinematic model of the supporting leg and readouts from an inertial sensor. It can be used in any terrain, regardless of the robots body design or the control strategy applied, and it is robust in regard to foot twist. It is also immune to limited foot slide and temporary lack of foot contact.

J.Mozaryn, J.Klimaszewski, D.Swieczkowski-Feiz, P.Kolodziejczyk, P.Wawrzynski, "Design process and experimental verification of the quadruped robot wave gait", *Proceedings of the 9th International Conference on Methods and Models in Automation and Robotics, MMAR,* pp. 206-211, 2014. doi:10.1109/MMAR.2014.6957352.

ABSTRACT: In this paper there is presented the design process and experimental verification of
the quadruped robot wave gait. Mathematical model of a robot movement is a result of linking
together derived leg movement equations with a scheme of their locomotion. The gait is designed
and analysed based on twostep design procedure which consists of simulations using MSC Adams
and Matlab environments and experimental verification using real quadruped robot.

P.Wawrzynski, "Reinforcement Learning with Experience Replay for Model-Free Humanoid Walking Optimization," *International Journal of Humanoid Robotics,*Vol. 11, No. 3, pp. 1450024, 2014. doi:10.1142/S0219843614500248.

ABSTRACT: In this paper a control system for humanoid robot walking is approximately optimized
by means of reinforcement learning. Given is a 18 DOF humanoid whose gait is based on replaying
a simple trajectory. This trajectory is translated into a reactive policy. A neural network
whose input represents the robot state learns to produce appropriate output that additively
modifies the initial control. The learning algorithm applied is Actor-Critic with experience
replay. In 50 minutes of learning, the slow initial gait changes to a~dexterous and fast walking. No model of the robot dynamics is engaged. The methodology in use is generic and can be applied to optimize control systems for diverse robots of comparable complexity.

Keywords: reinforcement learning, learning in robots, humanoids, bipedal walking.

P.Wawrzynski, *Podstawy sztucznej inteligencji,*Oficyna Wydawnicza Politechniki Warszawskiej, 2014. (Eng. *Fundamentals of artificial intelligence,*Publishing House of Warsaw University of Technology, 2014.)

Skrypt zawiera materiał wprowadzający do dziedziny sztuczna inteligencja. Jest podzielony na trzy części odpowiadające jej głównym działom: wnioskowaniu, przeszukiwaniu i uczeniu maszynowemu. W opracowaniu sztuczna inteligencja jest przedstawiona jako zbiór metod współtworzących arsenał współczesnej informatyki. Prezentowanym technikom towarzyszą liczne przykłady ilustrujące ich zastosowanie.

B.Papis, P.Wawrzynski, "dotRL: A platform for rapid Reinforcement Learning methods development and validation," *Proceedings of the Federated Conference on Computer Science and Information Systems,* pp. 129-136, 2013. [pdf]

ABSTRACT: This paper introduces dotRL, a platform that enables fast implementation and testing of Reinforcement Learning algorithms against diverse environments. dotRL has been written under .NET framework and its main characteristics include: (i) adding a new learning algorithm or environment to the platform only requires implementing a simple interface, from then on it is ready to be coupled with other environments and algorithms, (ii) a set of tools is included that aid running and reporting experiments, (iii) a set of benchmark environments is included with as demanding as Octopus-Arm and Half-Cheetah, (iv) the platform is available for instantaneous download, compilation, and execution, without libraries from different sources.

Index Terms - Reinforcement learning, evaluation platform, software engineering.

P.Wawrzynski, J.Mozaryn, J.Klimaszewski, "Robust velocity estimation for legged robot using on-board sensors data fusion," *Proceedings of the International Conference on Methods and Models in Automation and Robotics (MMAR), August 26-29, 2013, Międzyzdroje, Poland,* pp. 717-722, IEEE, 2013. [pdf]

ABSTRACT: Availability of momentary velocity of a legged robot
is essential for its efficient control. However, estimation of
the velocity is difficult, because the robot does not need to touch
the ground all the time or its feet may twist. In this paper we
introduce a method for velocity estimation in a legged robot that
combines kinematic model of the supporting leg, readouts from an
inertial sensor, and Kalman Filter. The method alleviates all
the above mentioned difficulties.

Index Terms - legged locomotion, velocity estimation, Kalman Filter.

P.Suszynski, P.Wawrzynski, "Learning population of spiking neural networks with perturbation of conductances," *Proceedings of the International Joint Conference on Neural Networks, August 4-9, 2013, Dallas TX, USA,* pp. 332-337, IEEE, 2013. [pdf]

ABSTRACT: In this paper a method is presented for learning of
spiking neural networks. It is based on perturbation of synaptic
conductances. While this approach is known to be model-free, it
is also known to be slow, because it applies improvement direction
estimates with large variance. Two ideas are analysed to alleviate
this problem: First, learning of many networks at the same
time instead of one. Second, autocorrelation of perturbations in
time. In the experimental study the method is validated on three
learning tasks in which information is conveyed with frequency
and spike timing.

Index terms - Spiking neural networks, learning.

P.Wawrzynski, A.K.Tanwani,"Autonomous Reinforcement Learning with Experience Replay," *Neural Networks 41,* pp. 156-167, Elsevier, 2013. doi:10.1016/j.neunet.2012.11.007.

ABSTRACT: This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time.

Keywords: reinforcement learning, autonomous learning, step-size estimation, actor-critic

P.Wawrzynski, *Sterowanie Adaptacyjne i Uczenie Maszynowe - preskrypt wykładu,*Politechnika Warszawska, 2012. (Eng. *Adaptive Control and Machine Learning - printed series of course lectures,*Warsaw University of Technology, 2012.) [pdf]

STRESZCZENIE:Skrypt omawia różne podejścia do adaptacji w zastosowaniu do optymalizacji działania systemów sterujących. Te podejścia to: uczenie się ze wzmocnieniem (reinforcement learning), sterowanie adaptacyjne z modelem referencyjnym (model reference adaptive control), samostrojące się regulatory (self-tuning regulators). Ponadto, w skrypcie dokonany jest przegląd innych forma adaptacji, której można użyć w systemach technicznych, np. omówiony jest Filtr Kalmana.

P.Wawrzynski, "Autonomous Reinforcement Learning with Experience Replay for Humanoid Gait Optimization," Proceedings of the International Neural Network Society Winter Conference (INNS-WC2012), pp. 205-211, Elsevier, 2012. doi:10.1016/j.procs.2012.09.130.

ABSTRACT: This paper demonstrates application of Reinforcement Learning to optimization of control of a complex system in realistic setting that requires efficiency and autonomy of the learning algorithm. Namely, Actor-Critic with experience replay (which addresses efficiency), and the Fixed Point method for step-size estimation (which addresses autonomy) is applied here to approximately optimize humanoid robot gait. With complex dynamics and tens of continuous state and action variables, humanoid gait optimization represents a challenge for analytical synthesis of control. The presented algorithm learns a nimble gait within 80 minutes of training.

Keywords: Reinforcement learning; Autonomous learning; Learning in robots

P.Wawrzynski, B.Papis, "Fixed point method for autonomous on-line neural network training," *Neurocomputing 74,* pp. 2893-2905, Elsevier, 2011. doi:10.1016/j.neucom.2011.03.029.

ABSTRACT: This paper considers on-line training of feedforward neural networks. Training examples are only available through sampling from a certain, possibly infinite, distribution. In order to make the learning process autonomous, one can employ Extended Kalman Filter or stochastic steepest descent with adaptively adjusted step-sizes. Here the latter is considered. A scheme of determining step-sizes is introduced that satisfies the following requirements: (i) it does not need any auxiliary problem-dependent parameters, (ii) it does not assume any particular loss function that the training process is intended to minimize, (iii) it makes the learning process stable and efficient. An experimental study with several approximation problems is presented. Within this study the presented approach is compared with Extended Kalman Filter and LFI, with satisfactory results.

Keywords: On-line learning; Autonomous learning; Step-size adaptation; Extended Kalman Filter

P.Wawrzynski, "Fixed point method of step-size estimation for on-line neural network training," *Proceedings of WCCI 2010 IEEE World Congress on Computational Intelligence, July, 18-23, 2010, Barcelona, Spain,* IEEE, pp. 2012-2017. [pdf]

ABSTRACT:This paper considers on-line training of feadforwardneural networks. Training examples are only availablesampled randomly from a given generator. What emerges inthis setting is the problem of step-sizes, or learning rates,adaptation. A scheme of determining step-sizes is introducedhere that satisfies the following requirements: (i) it does notneed any auxiliary problem-dependent parameters, (ii) it doesnot assume any particular loss function that the training processis intended to minimize, (iii) it makes the learning process stableand efficient. An experimental study with the 2D Gabor functionapproximation is presented.

Keywords: neural networks, on-line learning, step-size adaptation,reinforcement learning.

P.Wawrzynski, *Systemy adaptacyjne i ucząces się - preskrypt wykładu,* Oficyna Wydawnicza Politechniki Warszawskiej, 2009.

W skrypcie omówiono mechanizmy adaptacji możliwe do aplikowania w systemach tworzonych przez człowieka. Celem adaptacji jest poprawa działania systemu w trakcie pracy. Nie zawsze funkcjonowanie zaprojektowanego systemu jest zadowalające, więc musi on się *nauczyć* jak działać optymalnie. W pracy podano metody i algorytmy potrzebne przy projektowaniu systemów adaptacyjnych i uczących.

P.Wawrzynski, "Real-Time Reinforcement Learning by Sequential Actor-Critics and Experience Replay," *Neural Networks 22,* pp. 1484-1497, Elsevier, 2009. doi:10.1016/j.neunet.2009.05.011.

ABSTRACT: Actor-Critics constitute an important class of reinforcement learning algorithms that can deal with continuous actions and states in an easy and natural way. This paper shows how these algorithms can be augmented by the technique of experience replay without degrading their convergence properties, by appro- priately estimating the policy change direction. This is achieved by truncated importance sampling applied to the recorded past experiences. It is formally shown that the resulting estimation bias is bounded and asymptotically vanishes, which allows the experience replay-augmented algorithm to preserve the convergence properties of the original algorithm. The technique of experience replay makes it possible to utilize the available computational power to reduce the required number of interactions with the environment considerably, which is essential for real-world applications. Experimental results are presented that demonstrate that the combination of experience replay and Actor-Critics yields extremely fast learning algorithms that achieve successful policies for nontrivial control tasks in considerably short time. Namely, the policies for the cart-pole swing-up (Doya, 2000) are obtained after as little as 20 minutes of the cart-pole time and the policy for Half-Cheetah (a walking 6-degree-of-freedom robot) is obtained after four hours of Half-Cheetah time.

P.Wawrzynski, "A Cat-Like Robot Real-Time Learning to Run," * Lecture Notes in Computer Science 5495,* pp. 380-390, Springer-Verlag, 2009. doi:10.1007/978-3-642-04921-7_39. For demo see here.

ABSTRACT: Actor-Critics constitute an important class of reinforcement learning algorithms that can deal with continuous actions and states in an easy and natural way. In their original, sequential form, these algo- rithms are usually to slow to be applicable to real-life problems. However, they can be augmented by the technique of experience replay to obtain a satisfying speed of learning without degrading their convergence prop- erties. In this paper experimental results are presented that show that the combination of experience replay and Actor-Critics yields very fast learning algorithms that achieve successful policies for nontrivial control tasks in considerably short time. Namely, a policy for a model of 6-degree-of-freedom walking robot is obtained after 4 hours of the robot's time.

P.Wawrzynski, J. Arabas, P. Cichosz, "Predictive Control for Artificial Intelligence in Computer Games," *Lecture Notes in Artificial Intelligence 5097,* pp. 1137-1148, Springer-Verlag, 2008. doi:10.1007/978-3-540-69731-2_107.

ABSTRACT: The subject of this paper is artificial intelligence (AI) of non-player characters in computer games, i.e. bots. We develop an idea of game AI based on predictive control. Bot's activity is defined by a currently realized plan. This plan results from an optimization process in which random plans are continuously generated and reselected. We apply our idea to implement a bot for the game Half-Life. Our bot, Randomly Planning Fighter (RPF), defeats the bot earlier designed for Half-Life with the use of behavior-based techniques. The experiments prove that on-line planning can be feasible in rapidly changing environment of modern computer games.

P.Wawrzynski, A.Pacut "Truncated Importance Sampling for Reinforcement Learning with Experience Replay,"*Proceedings of the International Multiconference on Computer Science and Information Technology,* pp. 305-315, 2007. [pdf]

ABSTRACT: Reinforcement Learning (RL) is considered here as an adaptation technique of neural controllers of machines. The goal is to make Actor-Critic algorithms require less agent-environment interaction to obtain policies of the same quality, at the cost of additional background computations. We propose to achieve this goal in the spirit of *it experience replay*. An estimation method of improvement direction of a changing policy, based on preceding experience, is essential here. We propose one that uses truncated importance sampling. We derive bounds of bias of that type of estimators and prove that this bias asymptotically vanishes. In the experimental study we apply our approach to the classic Actor-Critic and obtain 20-fold increase in speed of learning.

P.Wawrzynski, "Learning to Control a 6-Degree-of-Freedom Walking Robot," *Proceedings of EUROCON 2007 The International Conference on Computer as a Tool,* pp. 698-705, 2007. [pdf]

ABSTRACT: We analyze the issue of optimizing a control policy for a complex system in a simulated trial-and-error learning process. The approach to this problem we consider is Reinforcement Learning (RL). Stationary policies, applied by most RL methods, may be improper in control applications, since for time discretization fine enough they do not exhibit exploration capabilities and define policy gradient estimators of very large variance. As a remedy to those difficulties, we proposed earlier the use of piecewise non- Markov policies. In the experimental study presented here we apply our approach to a 6-degree-of-freedom walking robot and obtain an efficient policy for this object.

P.Wawrzynski, "Reinforcement Learning in Fine Time Discretization," *Lecture Notes in Computer Science 4431,* pp. 470-479, 2007. doi:10.1007/978-3-540-71618-1_52.

ABSTRACT: Reinforcement Learning (RL) is analyzed here as a tool for control system optimization. State and action spaces are assumed to be continuous. Time is assumed to be discrete, yet the discretization may be arbitrarily fine. It is shown here that stationary policies, applied by most RL methods, are improper in control applications, since for fine time discretization they can not assure bounded variance of policy gradient estimators. As a remedy to that difficulty, we propose the use of piecewise non-Markov policies. Policies of this type can be optimized by means of most RL algorithms, namely those based on likelihood ratio.

P.Wawrzynski, A.Pacut, "Balanced Importance Sampling Estimation," *Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU), Paris, July 2-7, 2006, * pp. 66-73. [pdf ps]

ABSTRACT: In this paper we analyze a particular issue of estimation, namely the estimation of the expected value of an unknown function for a given distribution, with the samples drawn from other distributions. A motivation of this problem comes from machine learning. In reinforcement learning, an intelligent agent that learns to make decisions in an unknown environment encounters the problem of judging an arbitrary decision policy (the given distribution) on the basis of previous decisions and their outcomes suggested by previous policies (other distributions).

The problem can be solved with the use of well established importance sampling estimators. To overcome a potential problem of excessive variance of such estimators, we introduce the family of balanced importance sampling estimators, prove their consistency and demonstrate empirically their superiority over the classical counterparts.

Keywords: Estimation, Importance Sampling, Machine Learning, Reinforcement Learning.

P.Wawrzynski, "Symulacja Płaskich łańcuchów Kinematycznych," Raport nr 05-06 Instytutu Automatyki i Informatyki Stosowanej, Listopad 2005. [pdf]

STRESZCZENIE: W raporcie przedstawiony jest pewien algorytm symulowania dynamiki płaskich łancuchów kinematycznych. Jest on oparty na metodzie Eulera-Newtona. Przyjmuje się, że w ciągu kwantu czasu przyspieszenia w obiekcie są stałe. Istotą algorytmu jest zatem znalezienie tych przyspieszeń. Koszt obliczeniowy tej operacji jest liniowy w liczbie elementów obiektu.

Analizowane są płaskie łancuchy kinematyczne w postaci prętów (ogniw) połączonychobrotowymi stopniami swobody (złączami). Ogniwa są sztywne a cała masa obiektu jest rozlokowana w złaczach. Przeanalizowane sa złacza kilku typów: poruszające się bez ograniczeń, poruszajace sie po prostej, poruszające sie z zadanym przyspieszeniem. Ponadto kąt między ogniwami sąsiadującymi ze złączem może być stały lub zmianiać się stosowanie do indukowanych w łancuchu przyspieszeń złącz. Analizie poddano typowe zjawiska towarzyszące symulacji takie jak zderzenia (upadki) złącz.

P.Wawrzynski, A.Pacut, "Reinforcement Learning in Quasi-Continuous Time," *Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation, November 2005, Vienna, Austria, * pp. 1031-1036.

Reinforcement Learning (RL) is used here as a tool for control systems optimization. State and action spaces are assumed to be continuous. Time is assumed to be discrete, yet the discretization may be arbitrarily fine. Within the proposed algorithm, a piece of information that leads to a policy improvement, is inferred from an experiment that lasts for several consecutive steps, rather than from a single step, as in more traditional RL methods. Simulations reveal that the algorithm is able to optimize the control policies for plants for which it is very difficult to apply the traditional methods.

Keywords: Machine Learning, Reinforcement Learning, Adaptive Control.

P.Wawrzynski, "Intensive Reinforcement Learning,"Ph.D. dissertation, Institute of Control and Computation Engineering, Warsaw University of Technology, may 2005. [ps pdf]

ABSTRACT: The Reinforcement Learning (RL) problem is analyzed in this dissertation in the language of statistics as an estimation issue. A family of RL algorithms is introduced. They determine a control policy by processing the entire known history of plant-controller interactions. Stochastic approximation as a mechanism that makes the classical RL algorithms converge is replaced with batch estimation. The experimental study shows that the algorithms obtained are able to identify parameters of nontrivial controllers within a few dozens of minutes of control. This makes them a number of times more efficient than their existing equivalents.

P.Wawrzynski, A.Pacut, "Model-free off-policy reinforcement learning in continuous environment," *Proceedings of the International Joint Conference on Neural Networks, Budapest, July 2004, * pp. 1091-1096. [ps pdf]

ABSTRACT: We introduce an algorithm of reinforcement learning in continuous state and action spaces. In order to construct a control policy, the algorithm utilizes the entire history of agent-environment interaction. The policy is a result of an estimation process based on all available information rather than result of stochastic convergence as in classical reinforcement learning approaches. The policy is derived from the history directly, not through any kind of a model of the environment.

We test our algorithm in the Cart-Pole Swing-Up simulated environment. The algorithm learns to control this plant in about 100 trials, which corresponds to 15 minutes of plant's real time. This time is several times shorter than the one required by other algorithms.

P.Wawrzynski, A.Pacut, "Intensive versus nonintensive actor-critic algorithms of reinforcement learning," *Lecture Notes in Artificial Intelligence 3070,* pp. 934-941, Springer-Verlag, 2004. doi: 10.1007/978-3-540-24844-6_145.

ABSTRACT: Algorithms of reinforcement learning usually employ consecutive agent's actions to construct gradients estimators to adjust agent's policy. The policy is a result of some kind of stochastic approximation. Because of the slowness of stochastic approximation, such algorithms are usually much too slow to be employed, e.g. in real-time adaptive control.

In this paper we analyze the replacing of the stochastic approximation with the estimation based on the entire available history of an agent-environment interaction. We design an algorithm of reinforcement learning in continuous space/action domain that is of orders of magnitude faster then the classical methods.

P.Wawrzynski, A.Pacut, "A simple actor-critic algorithm for continuous environments," *Proceedings of the 10th IEEE Int. Conf. on Methods and Models in Automation and Robotics, August 2004*, pp. 1143-1149.

ABSTRACT: In reference to methods analyzed recently by Sutton *et al*, and Konda & Tsitsiklis, we propose their modification called Randomized Policy Optimizer (RPO). The algorithm has a modular structure and is based on the value function rather than on the action-value function. The modules include neural approximators and a parameterized distribution of control actions. The distribution must belong to a family of *smoothly exploring* distributions that enables to sample from control action set to approximate certain gradient. A *pre-action-value function* is introduced similarly to the action-value function, with the first action replaced by the first action distribution parameter.

The paper contains an experimental comparison of this approach to reinforcement learning with model-free Adaptive Critic Designs, specifically with Action-Dependent Adaptive Heuristic Critic. The comparison is favorable for our algorithm.

P.Wawrzynski, P.Podsiadly, G.Lehmann, "IOT Methodology of Frequency Assignment in Cellular Network," *Proceedings of the MOST International Conference, October 2002,* pp. 313-324.

ABSTRACT: We present the constraints based methodology of solving the frequency assignment problem in Cellular Phone Network. The methodology is based on taking radio measurements in territory where a given network works. The measurements are exploited to approximate areas of cells and areas of interference that would occur in case of transceivers' frequencies assigned too close. A set of constraints for the frequency assignment is computed in order to minimize the interference. The frequencies are then assigned in the process of discrete optimization with constraints.

The standard method of dealing with the frequency assigned problem places emphasis on optimization. The shape of minimized function is determined with the use of signal propagation models. Unfortunately these models lack precision. Thus emerges the need of empirical assessment of signal strength. Determining the constraint set as much restrictively as possible becomes in practice even more important than the efficiency of optimization process.

P.Wawrzynski, A.Pacut, "Modeling of distributions with neural approximation of conditional quantiles," *Proceedings of the 2nd IASTED Int. Conf. Artificial Intelligence and Applications, Malaga, Spain, September 2002, pp. 539-543.*[pdf, ps]

ABSTRACT: We propose a method of recurrent estimation of conditional quantiles stemming from stochastic approximation. The method employs a sigmoidal neural network and specialized training algorithm to approximate the conditional quantiles. The approach may by used in a wide range of fields, in particular in econometrics, medicine, data mining, and modeling.

Copyright © Marianna Krzewińska 2016