eng / pl
Paweł Wawrzyński
at Warsaw Univerity of Technology

Research


Leading Idea
Current research

Leading Idea

My research focus on development of methods of learning or adaptation that enable artificial systems to acquire or improve its capabilities such as:
  • control,
  • decision making,
  • planning,
  • perception,
  • memory.
My research concerns mainly the following areas:

Current research

My current research focus on the following issues:
  • Reinforcement learning
  • Graph neural networks
  • Automated trading
  • Machine vision
  • Continual learning

Graph and recurrent neural networks

Recursive graph autoencoder.
The architecture presented in the article recursively transforms the adjacency matrix of a graph to a fixed-dimension embedding, and then, also recursively, performs the inverse transformation. As a result, virtually any graph can be represented by a vector of constant dimension using this architecture.
Least redundant gated recurrent neural network.
The recursive network presented in the article can perform any non-linear transformation of its state within a single time instant. The strcture of the network contains only one layer of gates. The network performs much better than LSTM, GRU and RHN.

Continual learning

Problem description.
Let's say a neural model is trained on data coming in batches (e.g. recordings on subsequent days). None of these batches are representative (e.g. due to seasonal variations). After training on a given batch, we want the model to be as accurate as if it was trained on all data available at once.
BinPlay and LogCL.
Together with colleagues from the Computer Vision Lab of WUT, we have developed a continuous learning method based on rehearsing previous data and a very effective way of memorizing these previous data. The method was described in a conference paper and a journal paper.

Parameterless on-line learning

Problem description.
All algorithms based on stachastic gradient descent, and most on-line learning algorithms among them, require providing a parameter, called step-size, that defines lenght of steps that are being made along gradient estimators. Unfortunatelly, good values of these parameters are problem- and stage-of-process-dependent. The question is how to estimate them on-line during operation of the stochastic gradient descent algorithm?
ASD+M method.
This approach is based on optimization of meta-parameters of the method of stochastic gradient descent with momentum with respect to learning performance. The resulting algorithm was presented in a conference paper and a journal paper. An improved version of the algorithm was presented in another conference paper.
Fixed point method.
I investigated the approach based on spliting the process into parts in which the gradient estimators are being computed in the consecutive points and in fixed (within parts) points. Comparison of sums of gradient estimators computed in the fixed point and in the changing points gives a clue how to adjust the step-size. The resulting family of algorithms has been presented in a conference paper and a journal paper

Autonomous reinforcement learning for optimization of Bioloid walking

Problem description. The objective is to make Bioloid learn to walk forward as fast as possible.
The approach. Actor-Critic + experience replay + fixed-point method for step-size estimation.
The results. This research resulted in a conference paper and a journal paper

Actor-Critc algorithms with experience replay

Problem description.
Algorithms developed in the area of Reinforcement Learning can be viewed as computational processes that transform observations of states, actions and rewards into policy parameters. Several important RL algorithms, such as Q-Learning (Watkins, 1989) and Actor-Critic methods (Barto, Sutton & Anderson, 1983, Kimura & Kobayashi, 1998, Konda & Tsitsiklis, 2003), process the data sequentially. Each single observation is used for adjusting the algorithms' parameters and then becomes unavailable for further use. Sequential algorithms do not exploit all the information contained in the dataand are known to require a large number of environment steps to obtain a satisfying policy. Usually, this number is large enough to make the learning process detrimental to any real device whose control policy we would hope to optimize by means of RL.
The approach.
I investigated the approach based on collecting observations of state, actions, and rewards in a database and intensive processing thereof in a procedure parallel to the agent-environment interaction. I explored two variants of this approach.
  • Within the first variant, the policy is a subject of estimation based on the collected data. The algorithm designed this way was presented in a paper and in a PhD thesis.
  • A more mature variant of this approach is as follows: (i) the starting point is any sequential Actor-Critic algorithm, (ii) it is modified such that it performs similar (but not the same) operations as the original algorithm, but many times per one state transition, with the use of data on previous events. This variant is presented in a paper; furthermore, an algorithm designed this way was applied to a model of a cat that learns to run (Half-Cheetah), which was presented in a paper and an article.

Estimation of an expected value of one distribution with the use of samples from another distribution - balanced importance sampling

In order to evaluate a given policy on the basis of observations from the agent-environment interaction, one has to estimate the expected value of a function of a random variable defined by one distribution, with the use of variables sampled from another distribution. One of ways to perform efficiently such the estimation in a method called balanced importance sampling.

A framework for application of reinforcement learning algorithms to problems with very fine time discretization

Problem description.
Most RL algorithms optimize stationary policies, i.e. ones that draw an action only on the basis of a current state. Application of RL in control systems requires discretization of time. Good control requires fine time discretization. However, a stationary stochastic policy applied to a deterministic system leads to a deterministic behavior of the system for diminishing time discretization. Clearly, this phenomenon precludes exploration capabilities of such policies.
The approach.
Our remedy for the fine time discretization problem is based on defining a policy in a special way. Namely, the policy divides agent-environment interaction into periods such that it relates actions with each others within the same period. Within each period a coherent experiment is carried out that gives a clue to policy improvement. On the basis of a given Markov Decision Process (MDP) and such the policy we define a new MDP and a stationary policy in the new one. We show that each RL algorithm based on likelihood ratio can be applied to optimize the stationary policy in the new MDP.
Our approach was presented in a paper and its application to Half-Cheetah was presented in another paper.

A method of planar kinematic chain simulation

Efficient reinforcement learning algorithms should be applicable to control optimization of complex dynamical systems. A simulator of planar kinematic chains was developed as a testbed for RL methods. A methodology of its creation was presented in a report.
Capabilities of the simulator are presented below.
Running Half-Cheetah
Figure 1. Running Half-Cheetah. Slow-motion movie of the run is available here.

Predictive control based artificial intelligence of a bot in FPS-type computer game

An artificial intelligence of a computer game bot may be based on predictive control. This idea as presented in a paper, along with its implementation in HalfLife - a First-Person-Shooter game.

Estimation of probabilistic distribution with the use of neural approximation of conditional quantiles

A joint probabilistic distribution of scalar random variables X1, ..., Xn may be modeled with arbitrary accuracy by the following entities:
  • A set of quantiles of X2|X1 conditional distribution of orders as above,
  • A set of quantiles of X2|X1 conditional distribution of orders as above,
  • A set of quantiles of X3|X1,X1 as above,
  • etc.
In order to apply the above method of modeling multidimensional distributions, one requires a tool for modeling scalar conditional distributions. This tool was presented in a paper, in the form of a neural network with appropriate learning scheme.

A method of frequency assignment in a cellular network

Cooperation of Warsaw University of Technology with Polska Telefonia Cyfrowa (a Polish cellphone operator) yielded a methodology of frequency assignment to base stations. The methodology is based on measurements of signal from various base stations taken in points covering the area. Basing on the measurements, the conflict relations are calculated for each pair of the stations. Conflicted base stations are assigned different frequencies. This methodology is described in detailed in a paper.