This page will be about tips for the agent_config.yaml file, which is used to specify the type of training which is given and how rewards influence the updating of the neural network.
Here is an example of all our recommended options. Note that you are in no way required to use all of the options which are included in this example. They are merely here for your reference if you choose to use them. If you wish to know more about what each parameter does and what typical numbers are for each one, check out the Unity Training Configurations docs
behaviors: Agent: trainer_type: ppo hyperparameters: batch_size: 1024 buffer_size: 10240 learning_rate: 3.0e-4 beta: 5.0e-4 epsilon: 0.2 lambd: 0.99 num_epoch: 6 learning_rate_schedule: linear network_settings: normalize: false hidden_units: 256 num_layers: 3 #------------------------------ behavioral_cloning: demo_path: Assets/Demonstrations/testName.demo strength: 0.5 steps: 500000 batch_size: 512 num_epoch: 3 samples_per_update: 0 #------------------------------ reward_signals: extrinsic: gamma: 0.99 strength: 1.0 #------------------------------ curiosity: strength: 0.02 gamma: 0.99 encoding_size: 256 #------------------------------ gail: strength: 0.1 gamma: 0.9 demo_path: Assets/Demonstrations/testName.demo encoding_size: 64 use_actions: true #------------------------------ self_play: window: 10 play_against_latest_model_ratio: 0.5 save_steps: 50000 swap_steps: 5000 team_change: 100000 #------------------------------ time_horizon: 64 max_steps: 10000000 summary_freq: 50000
Extrinsic Rewards
These are the normal rewards which you have already used. They are external rewards given to an agent upon completing a certain task (ex: picking up a target).
Link to Unity docs on rewards examples and best practices
Behavioral Cloning and Gail
Learning from human play data can be a great way to train your agent. We have found that using at least one of these methods can be extremely in helping your agent figure out the basic controls and point of the game. Note that both of these methods require making a training demonstration using the demonstration recorder.
Behavioral Cloning
This method tries to copy the actions of the human data directly, leading to a closer replication of the human data. It will run for the amount of steps specified, and then turn off. This is great for when the agent is just starting.
Link to unity docs for behavioral cloning
GAIL
GAIL stands for Generative Adversarial Imitation Learning. Basically, it will allow your agent to be influenced by human data without directly copying it.
Great computerphile video explaining Generative Adversarial Networks
Creating a demonstration using the demonstration recorder
See the unity docs for how to record demonstrations
Curiosity
In short, curiosity gives the agent rewards when it tries something new. This can be helpful when solving tasks that require multiple steps, or tasks which may have easily exploitable local maxima that the agent can get stuck doing.
Link to Unity blog post about curiosity
Self-Play
Self-play is a way to train an agent using itself as an opponent. Normally, you can train two copies of an agent just fine by including two of them with the same name in the same training area (which is what you have been doing so far). But this only allows us to train using the explicit rewards which we give, and reward is only a proximate measurement of what we really care about – winning. With self-play, win rate is the ultimate measurement (using ELO as the measure) and agents which win a lot are kept and used as training against further agents.
If you wish to use self play in your training, you must drag and drop two copies of your agent into the game world and make sure to set their Behaviour Parameters -> Team ID manually to different numbers (0 and 1 work, see video for example). So long as you have done this, and you have included the section for self play in the config file you can run training as normal. We have already added code which awards the correct reward to winners and losers.
Note that self play currently contains a bug which does not allow you to resume your training after it has been stopped. This is fine, as it just means that you must train your agent in one session. If you are careful and make sure to test your agent without self-play first, this is a great way to finish off the training of your agent.