Training Configuration

This page will be about tips for the agent_config.yaml file, which is used to specify the type of training which is given and how rewards influence the updating of the neural network.

Here is an example of all our recommended options. Note that you are in no way required to use all of the options which are included in this example. They are merely here for your reference if you choose to use them. If you wish to know more about what each parameter does and what typical numbers are for each one, check out the Unity Training Configurations docs

behaviors:
  Agent:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 10240
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 6
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 256
      num_layers: 3
#------------------------------
    behavioral_cloning:
      demo_path: Assets/Demonstrations/testName.demo
      strength: 0.5
      steps: 500000
      batch_size: 512
      num_epoch: 3
      samples_per_update: 0
#------------------------------
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
#------------------------------
      curiosity:
        strength: 0.02
        gamma: 0.99
        encoding_size: 256
#------------------------------
      gail:
        strength: 0.1
        gamma: 0.9
        demo_path: Assets/Demonstrations/testName.demo
        encoding_size: 64
        use_actions: true
#------------------------------
    self_play:
      window: 10
      play_against_latest_model_ratio: 0.5
      save_steps: 50000
      swap_steps: 5000
      team_change: 100000
#------------------------------
    time_horizon: 64
    max_steps: 10000000
    summary_freq: 50000

Extrinsic Rewards

These are the normal rewards which you have already used. They are external rewards given to an agent upon completing a certain task (ex: picking up a target).

Link to Unity docs on rewards examples and best practices

 

Behavioral Cloning and Gail

Learning from human play data can be a great way to train your agent. We have found that using at least one of these methods can be extremely in helping your agent figure out the basic controls and point of the game. Note that both of these methods require making a training demonstration using the demonstration recorder.

Behavioral Cloning

This method tries to copy the actions of the human data directly, leading to a closer replication of the human data. It will run for the amount of steps specified, and then turn off. This is great for when the agent is just starting.

Link to unity docs for behavioral cloning 

GAIL

GAIL stands for Generative Adversarial Imitation Learning. Basically, it will allow your agent to be influenced by human data without directly copying it.

Link to unity docs for GAIL

Great computerphile video explaining Generative Adversarial Networks

Creating a demonstration using the demonstration recorder

See the unity docs for how to record demonstrations

 

Curiosity

In short, curiosity gives the agent rewards when it tries something new. This can be helpful when solving tasks that require multiple steps, or tasks which may have easily exploitable local maxima that the agent can get stuck doing.

Link to Unity blog post about curiosity

 

Self-Play

Self-play is a way to train an agent using itself as an opponent. Normally, you can train two copies of an agent just fine by including two of them with the same name in the same training area (which is what you have been doing so far). But this only allows us to train using the explicit rewards which we give, and reward is only a proximate measurement of what we really care about – winning. With self-play, win rate is the ultimate measurement (using ELO as the measure) and agents which win a lot are kept and used as training against further agents.

If you wish to use self play in your training, you must drag and drop two copies of your agent into the game world and make sure to set their Behaviour Parameters -> Team ID manually to different numbers (0 and 1 work, see video for example). So long as you have done this, and you have included the section for self play in the config file you can run training as normal. We have already added code which awards the correct reward to winners and losers.

Note that self play currently contains a bug which does not allow you to resume your training after it has been stopped. This is fine, as it just means that you must train your agent in one session. If you are careful and make sure to test your agent without self-play first, this is a great way to finish off the training of your agent.

Link to unity self play blog post (explains concept)

Link to unity self play docs