RL algorithms are a nightmare to get working, below are some of my notes on how to get them to run effectively -
-
Use delta-based rewards instead of kernels. For example, if you are trying to get a manipulator to a certain point $\vec{x}$, $r(t) = -k (x_{t-1} - x_t)$. In my experience this works better than rewards of the form $r(\vec{x})=f(\vec{x})$. I suspect this is because the change in the relative magnitude of the reward for making progress towards the goal vs not making progress towards the goal over a single timestep is much bigger.
- Also when designing stuff from scratch, look for what successful papers have used as reward functions, as they will likely have experimented a lot to find the right reward function and you don’t want to repeat this work. For robot manipulation & locomotion, many of the Robotic Systems Lab papers have good references for this which are proven to transfer to the real world [1, p15] [2].
-
Episode length - depending on how you calculate advantage, do updates, and how your environment resets, having too long of an episode length can slow down training. Alternatively, if the episode length is too short your agent may not be able to complete the task.
-
Control frequency - if set too high, each completion of your task in the simulator will take many timesteps and it will take longer to train.
-
The RL algorithm that will work best for you will depend on what kind of simulator you use. On-policy algorithms such as PPO work better when it is cheap to get environment samples (fast simulator). Off-policy algorithms such as SAC work better when your simulator is very slow.
- Don’t try to implement one of these from scratch yourself (unless you are doing it for educational purposes). (Hopefully) tested code in quasi-official repositories will almost certainly work better than whatever you can hack together. Denis Yarats' SAC algorithm is a well-documented extensible implementation. It’s also fairly easy to hook this up to Stable Baselines' SubprocVecEnv to parallelise your environment. RLLib has a wide variety of algorithms implemented, which come in with built-in support for large-scale parallelisation (though they are somewhat annoying to modify).
-
Action spaces can have a big impact on the ability to learn tasks, and this impact will vary by the nature of the task.
-
If you are having trouble training an environment, especially with multiple shaped rewards, try stripping back to the minimal possible objective and seeing if it trains on that one, and add additional objectives progressively.
-
Completion termination & bonus - if you want to improve the throughput of episodes as your RL algorithm gets better at achieving a goal, try terminating the episode when the agent gets close to the goal. But be careful to give it a bonus reward for doing this, especially if the value of your reward is positive. If you don’t the optimal strategy to maximise the discounted sum of future rewards will be to *not* achieve the goal, since terminating the episode by reaching the goal obviously means that the future value will be limited to the reward at the terminating timestep.
-
Use fast simulators like IsaacGym or RaiSim. This will both allow you to train on more complex tasks and improve the speed of your tweak env/algo -> train -> tweak env/algo loop.
-
If you expect your algorithm to depend on velocity information or other such things which may not be available with the state in a single timestep, make sure it has a way to infer this. It’s possible to do this by stacking observations (concatenate the raw observations from the last 3 timesteps), by directly passing velocity information, or by using a recurrent policy.
Real Robots / Sim2Real
I only have limited experience with transferring RL algorithms to real robots, however I have worked with Other Algorithms™ (yes, I know, they do exist) a fair bit on real robots, so some of these tips will generalise:
- First ensure you have the basics working - are the scaling on algorithm outputs correct, is your action space right, is the control frequency you are running your algorithm matching the simulator etc.
- Before trying 10001 tricks with Domain Randomisation, tuning simulator/algorithm parameters etc., ensure you can transfer your vanilla algorithm with reasonable performance.
- There are better methods than DR for getting sim2real transfer to work coming online which you can try, especially better calibration of simulators [1] [2]. In general, improving how close your simulator is to reality > adding ever more randomisation to your simulator (less work and it produces better results).
- In general, don’t try to fix problems in software if there is an issue with your physical robot. Fix that first and you will have much less frustration.