Behavioural cloning and dynamic system control by ML

Last update: January 2003

by Dorian Suc, Faculty for Computer and Information Science
Artificial Intelligence Laboratory, Trzaska 25, SI-1001 Ljubljana, Slovenia
e-mail: dorian.suc@fri.uni-lj.si, web address: http://ai.fri.uni-lj.si/dorian/

See my papers and thesis (Machine Reconstruction of Human Control Strategies) with a more complete description of dynamic domains, simulators and experiments.

The idea of behavioural cloning (a term introduced by Donald Michie, 1993, but probably first carried out by Donaldson, 1964) is to make use of the operator's skill in the development of an automatic controller. A skilled operator's control traces are used as examples for machine learning to reconstruct the underlying control strategy that the operator executes subconsciously.

The goal of behavioural cloning is not only to induce a successful controller, but also to achieve better understanding of the human operator’s subconscious skill. Behavioural cloning was successfully used in problem domains as pole balancing, production line scheduling, piloting (see the Claude's page), and operating cranes.

We give the demos and brief information about our experiments in 3 different domains:

Crane

Bike

Acrobot

Bike

Riding a bike is a challenging control task that requires maintaining the balance on the bike and driving it to the goal. Maintaining a balance on the bike is a very hard task and requires the proper adjustment of the front wheel direction by applying torque to the handlebars and/or the proper displacement of the center of mass. The requirement to approach the goal makes the task even harder, since the balancing and the goal-aiming are to be performed simultaneously.

The learning examples for our behaviour cloning were control traces from an experiment (in progress) where four students learned to manually control the bike simulator and complete the control task. They were required to balance the bike and drive it to the goal was in x coordinate 100 m. from the start position, with the initial frame direction of the bike -pi/2 rad. from the goal. The trial was successful if the bike reached the 5 m. radius of goal within 100 seconds and did not fall on the way.

The state of the bike is described by six variables: the tilt angle of the bicycle from vertical and its velocity, the angle between the front wheel direction and the bicycle direction (due to the deflection of the handlebars) and its velocity, the distance from the goal and the angle of the bike's frame relative to the goal. The system is controlled by two actions: the torque to apply to the handlebars and the displacement of the center of mass.

We used the simulator with the parameters as J. Randløv and P. Alstrøm: Learning to Drive a Bicycle using Reinforcement Learning and Shaping, ICML-98 (bicycle.ps.gz: 726 kb).

SEE THE CLONE IN ACTION: our simulator is written for MS-DOS and graphics in VGA mode. To try it, just download graphical bike demo for the MS-DOS, unzip it and run Bike.exe from MS-DOS or from Windows. Different simulator options are described in BkDemo.txt.

Crane

Cranes (see the crane postscript picture) are used in ports to transport a container from the shore to a target position on a ship. This requires two operations: positioning of the trolley, bringing it above the target load position, and rope operation, bringing the load to the desired height. The performance requirements include basic safety, stop-gap accuracy and as high capacity as possible. The last requirement means that the time for transportation is to be minimized. Consequently, the two operations are to be performed simultaneously. The most difficult aspect of the task is to control the swing of the rope. When the load is close to the goal position, the swing should ideally be zero.

The state of the system is specified by six variables: trolley position X and its velocity, rope inclination angle Phi and its angular velocity, rope length L and its velocity. Two control forces are applied to the system: force XF to the trolley in the horizontal direction and force YF in the direction of the rope.

We used experimental data from manually controlling the crane from a previous study (Urbancic, 1994).
In that study, six students volunteered to learn to control the simulator. Remarkable individual differences were observed regarding the characteristics of the strategy they used. Some operators tended towards fast and less reliable operation, others were more conservative and slower, in order to avoid large rope oscillations.

SEE THE CLONE IN ACTION: to see the demo download graphical crane demo for the MS-DOS (presented at QR'99), unzip it and run .bat or .pif files from DOS or Windows.
See also the corresponding paper (cloning the crane):
D.Suc, I.Bratko: Modelling of control skill by qualitative constraints, (zipped postscript) Thirteenth International Workshop on Qualitative Reasoning, editor: Price, C., pages 212-220, Aberystwyth: University of Aberystwyth, Loch Awe, Scotland, June 1999

See also:
Program code in C++ for the crane simulator.
Acrobot and crane dynamics system and our experiments with human learning (poscript document, zipped postscript)

Acrobot

The acrobot (see the acrobot postscript picture) is an underactuated (i.e. possessing fewer actuators than degrees of freedom) double pendulum. It consists of a two-link two-joint planar robot in a gravitational field. The torque control can be applied to the elbow joint q2, but the shoulder joint q1 is free swinging. Any rotation at q1 is entirely the result of dynamic coupling from the rest of the system. The dynamics of the Acrobot is similar to a gymnast on a high bar where link 1 is analogous to the gymnast's hands, arms and torso, link 2 represents the legs, and joint q2 is the gymnast's waist.

The state of the Acrobot is defined by angle q1 and its velocity and angle q2 and its velocity. One difficult and well-known problem is swing-up control. Here, the task is to move the Acrobot from its stable downward position to its unstable inverted position as fast as possible. A strategy to drive the controllable joint q2 so as to excite oscillation of q1, must be found. The oscillation must grow until a point of the unstable equilibrium, i.e. when the system's center of mass is directly above the q1 joint.

SEE THE CLONE IN ACTION: to see the demo download graphical acrobot demo for the MS-DOS, unzip it and run .bat or .pif files from DOS or Windows.

See also:
Acrobot and crane dynamics system and our experiments with human learning (poscript document, zipped postscript)

Behavioural cloning and dynamic system control by ML

Crane

Bike

Acrobot

Bike

Crane

Acrobot

Back to Dorian's main page