eng / pl

Paweł Wawrzyński

- Autonomous reinforcement learning for optimization of Bioloid walking
- Parameterless on-line learning
- Actor-Critic algorithms with experience replay
- Estimation of an expected value of one distribution with the use of samples from another distribution - balaced importance sampling
- A framework for application of reinforcement learning algorithms to problems with very fine time discretization
- A method of planar kinematic chain simulation
- Predictive control based artificial intelligence of a bot in FPS-type computer game
- Estimation of probabilistic distribution with the use of neural approximation of conditional quantiles
- A method of frequency assigment in a cellular network

My research focus on development of methods of learning or adaptation that enable artificial systems to acquire or improve its capabilities such as:

- control,
- decision making,
- planning,
- perception,
- memory.

My research concerns mainly the following areas:

- Reinforcement learning,
- Machine learning,
- Neural networks,
- Artificial intelligence,
- Adaptive control.

My current research focus on the following issues:

- Application of reinforcement learning with experience replay to a humanoid walking robot.
- Application of reinforcement learning with experience replay to trajectory optimization in robotic systems. The project is co-founded by Polish Ministry of Science and Higher Education.
- Autonomy of on-line learning. In partucular, I investigate methods of adaptive adjustment of step-sizes in on-line learning neural networks.

Problem description.
The objective is to make Bioloid learn to walk forward as fast as possible.

The approach.
Actor-Critic + experience replay + fixed-point method for step-size estimation.

The results.
This research resulted in a conference paper and a journal paper

Problem description.

All algorithms based on stachastic gradient descent, and most on-line learning algorithms among them, require providing a parameter, called step-size, that defines lenght of steps that are being made along gradient estimators. Unfortunatelly, good values of these parameters are problem- and stage-of-process-dependent. The question is how to estimate them on-line during operation of the stochastic gradient descent algorithm?
ASD+M method.

This approach is based on optimization of meta-parameters of the method of stochastic gradient descent with momentum with respect to learning performance. The resulting algorithm was presented in a conference paper and a journal paper.
Fixed point method.

I investigated the approach based on spliting the process into parts in which the gradient estimators are being computed in the consecutive points and in fixed (within parts) points. Comparison of sums of gradient estimators computed in the fixed point and in the changing points gives a clue how to adjust the step-size. The resulting family of algorithms has been presented in a conference paper and a journal paper
Problem description.

Algorithms developed in the area of Reinforcement Learning can be viewed as computational processes that transform observations of states, actions and rewards into policy parameters. Several important RL algorithms, such as Q-Learning (Watkins, 1989) and Actor-Critic methods (Barto, Sutton & Anderson, 1983, Kimura & Kobayashi, 1998, Konda & Tsitsiklis, 2003), process the data sequentially. Each single observation is used for adjusting the algorithms' parameters and then becomes unavailable for further use. Sequential algorithms do not exploit all the information contained in the dataand are known to require a large number of environment steps to obtain a satisfying policy. Usually, this number is large enough to make the learning process detrimental to any real device whose control policy we would hope to optimize by means of RL.
The approach.

I investigated the approach based on collecting observations of state, actions, and rewards in a database and intensive processing thereof in a procedure parallel to the agent-environment interaction. I explored two variants of this approach.
- Within the first variant, the policy is a subject of estimation based on the collected data. The algorithm designed this way was presented in a paper and in a PhD thesis.
- A more mature variant of this approach is as follows: (i) the starting point is any sequential Actor-Critic algorithm, (ii) it is modified such that it performs similar (but not the same) operations as the original algorithm, but many times per one state transition, with the use of data on previous events. This variant is presented in a paper; furthermore, an algorithm designed this way was applied to a model of a cat that learns to run (Half-Cheetah), which was presented in a paper and an article.

In order to evaluate a given policy on the basis of observations from the agent-environment interaction, one has to estimate the expected value of a function of a random variable defined by one distribution, with the use of variables sampled from another distribution. One of ways to perform efficiently such the estimation in a method called balanced importance sampling.

Problem description.

Most RL algorithms optimize stationary policies, i.e. ones that draw an action only on the basis of a current state. Application of RL in control systems requires discretization of time. Good control requires fine time discretization. However, a stationary stochastic policy applied to a deterministic system leads to a deterministic behavior of the system for diminishing time discretization. Clearly, this phenomenon precludes exploration capabilities of such policies.
The approach.

Our remedy for the fine time discretization problem is based on defining a policy in a special way. Namely, the policy divides agent-environment interaction into periods such that it relates actions with each others within the same period. Within each period a coherent experiment is carried out that gives a clue to policy improvement. On the basis of a given Markov Decision Process (MDP) and such the policy we define a new MDP and a stationary policy in the new one. We show that each RL algorithm based on likelihood ratio can be applied to optimize the stationary policy in the new MDP.
Our approach was presented in a paper and its application to Half-Cheetah was presented in another paper.

Efficient reinforcement learning algorithms should be applicable to control optimization of complex dynamical systems. A simulator of planar kinematic chains was developed as a testbed for RL methods. A methodology of its creation was presented in a report.

Capabilities of the simulator are presented below.

Figure 1. Running Half-Cheetah. Slow-motion movie of the run is available here.

An artificial intelligence of a computer game bot may be based on predictive control. This idea as presented in a paper, along with its implementation in HalfLife - a First-Person-Shooter game.

A joint probabilistic distribution of scalar random variables X_{1}, ..., X_{n} may be modeled with arbitrary accuracy by the following entities:

- A set of quantiles of X
_{2}|X_{1}conditional distribution of orders as above, - A set of quantiles of X
_{2}|X_{1}conditional distribution of orders as above, - A set of quantiles of X
_{3}|X_{1},X_{1}as above, - etc.

In order to apply the above method of modeling multidimensional distributions, one requires a tool for modeling scalar conditional distributions. This tool was presented in a paper, in the form of a neural network with appropriate learning scheme.

Cooperation of Warsaw University of Technology with Polska Telefonia Cyfrowa (a Polish cellphone operator) yielded a methodology of frequency assignment to base stations. The methodology is based on measurements of signal from various base stations taken in points covering the area. Basing on the measurements, the conflict relations are calculated for each pair of the stations. Conflicted base stations are assigned different frequencies. This methodology is described in detailed in a paper.

Copyright © Marianna Krzewińska 2016