Robotics/unmanned systems

Exploring Evolutionary Meta-Learning in Robotics

Rapid development of more-accurate simulator engines has given researchers the opportunity to generate sufficient data to train robotic policies for real-world deployment. However, moving from simulation to reality remains one of the greatest challenges of modern robotics.

googlerobotics.png
The algorithm quickly adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8 to 10 V, which reduced motor power, and a 500 g mass was also placed on the robot's side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes, or 150 s of real-world data.
Credit: Google.

Rapid development of more-accurate simulator engines has given robotics researchers a unique opportunity to generate sufficient amounts of data that can be used to train robotic policies for real-world deployment. However, moving trained policies from simulation to reality remains one of the greatest challenges of modern robotics, because of the subtle differences encountered between the simulation and real domains, termed the “reality gap.” While some recent approaches leverage existing data, such as imitation learning and offline reinforcement learning, to prepare a policy for the reality gap, a more common approach is simply to provide more data by varying properties of the simulated environment, a process called domain randomization.

However, domain randomization can sacrifice performance for stability because it seeks to optimize for a decent, stable policy across all tasks but offers little room for improving the policy on a specific task. This lack of a common optimal policy between simulation and reality is frequently a problem in robotic locomotion applications, where there are varying physical forces at play, such as leg friction, body mass, and terrain differences. For example, given the same initial conditions for the robot’s position and balance, the surface type will determine the optimal policy—for an incoming flat surface encountered in simulation, the robot could accelerate to a higher speed, while, for an incoming rugged and bumpy surface encountered in the real world, it should walk slowly and carefully to prevent falling.

The paper “Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning” presents a particular type of meta-learning based on evolutionary strategies (ES), an approach generally believed to only work well in simulation, which is used to effectively and efficiently adapt a policy to a real-world robot in a completely model-free manner. Compared with previous approaches for adapting meta-policies, such as standard policy gradients that do not allow simulation-to-real adaptation, ES enables a robot to quickly overcome the reality gap and adapt to dynamic changes in the real world, some of which may not be encountered in simulation. This represents the first instance of successfully using ES for on-robot adaptation.

The algorithm presented quickly adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8 to 10 V, which reduced motor power, and a 500 g mass was also placed on the robot's side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes (or 150 s of real-world data).

Meta-Learning

This research falls under the general class of meta-learning techniques and is demonstrated on a legged robot. At a high level, meta-learning learns to solve an incoming task quickly without completely retraining from scratch, by combining past experiences with small amounts of experience from the incoming task. This is especially beneficial in the simulation-to-real case, where most of the past experiences come cheaply from simulation, while a minimal, yet necessary, amount of experience is generated from the real-world task. The simulation experiences allow the policy to possess a general level of behavior for solving a distribution of tasks, while the real-world experiences allow the policy to fine-tune specifically to the real-world task at hand.

In order to train a policy to meta-learn, it is necessary to encourage a policy to adapt during simulation. Normally, this can be achieved by applying model-agnostic meta-learning (MAML), which searches for a meta-policy that can adapt to a specific task quickly using small amounts of task-specific data. The standard approach to computing such meta-policies is by using policy gradient methods, which seek to improve the likelihood of selecting the same action given the same state. In order to determine the likelihood of a given action, the policy must be stochastic, allowing for the action selected by the policy to have a randomized component. The real-world environment for deploying such robotic policies is also highly stochastic, because there can be slight differences in motion arising naturally even if starting from the exact same state and action sequence. The combination of using a stochastic policy inside a stochastic environment creates two conflicting objectives:

  • Decreasing the policy’s stochasticity may be crucial because, otherwise, the high-noise problem might be exacerbated by the additional randomness from the policy’s actions.
  • However, increasing the policy’s stochasticity may also benefit exploration because the policy needs to use random actions to probe the type of environment to which it adapts.

These two competing objectives seek to both decrease and increase the policy’s stochasticity and may cause complications.

Read the full story here.