Reinforcement learning (RL) example w/RLMolecule and Ray (5:47)

The goal of the rlmolecule library is to enable general-purpose material and molecular optimization using reinforcement learning. It explores the molecular space by adding one atom/bond at a time, learning how to make molecules with desired properties. The notebook makes running your own molecular optimization easy and accessible. Parameter description tables are shown below.

Full 1080p HD resolution and full-screen available through menu on the bottom right of the video

Option	Description
Starting molecule	Starting point for each molecule building episode
atom additions	Atom types to choose from when building molecules
max-atoms	Maximum number of heavy atoms
max-#-actions	Maximum number of actions to allow when building molecules
SA threshold	Potential molecules with a Synthetic Accessibility (SA) score greater than the threshold will not be considered. Used to filter out molecules unlikely to be synthesizable.
Output isomeric smiles	Option to not include information about stereochemistry in the starting molecule.
Stereoisomers	Option to consider stereoisomers as different molecules.
Canonicalize tautomers	Option to use RDKit’s tautomer canonicalization functionality.
3D embedding	Try to get a 3D embedding of the molecule, and if this fails remove it.
Cache	Option to cache molecule building for a given SMILES input to speed up subsequent evaluations.
GDB filter	Option to apply filters from the gdb17 paper to get more realistic, drug-like molecules e.g., no allenes (C═C═C). See tables 1-3 here: https://doi.org/10.1021/ci300415d)..)

Hyperparameter	Good default value	Range of good values	Description
Training parameters
gamma	1	0.8 - 1.0	Float specifying the discount factor for future rewards i.e., of the Markov Decision process. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large.
lr	0.001	0.0001, 0.001, 0.01	Learning rate corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase.
entropy_coeff	0.001	0, 0.001, 0.005, 0.01	Coefficient of the entropy regularizer. A policy has maximum entropy when all policies are equally likely and minimum when the one action probability of the policy is dominant. The entropy coefficient is multiplied by the maximum possible entropy and added to loss. This helps prevent premature convergence of one action probability dominating the policy and preventing exploration.
clip_param	0.2	0.1,0.2,0.3	Hyperparameter for clipping in the policy objective. Roughly: how far can the new policy go from the old policy while still profiting (improving the objective function)?
kl_coeff	0	0.0 - 1.0	Initial coefficient for KL divergence. Penalizes KL-divergence of the old policy vs the the new policy in the objective function. Larger values mean a larger penalty (smaller updates).
sgd_minibatch_size	10	10 - 256	SGD: Stochastic Gradient DescentTotal batch size across all devices for SGD. This defines the minibatch size within each epoch. Typically a larger batch size corresponds to more stable training updates.
num_sgd_iter	5	3 - 30	Number of SGD iterations in each outer loop (i.e., number of epochs to execute per train batch).
train_batch_size	1000	100 - 5000	Training batch size. Each train iteration samples the environment for <train_batch_size> steps.
GNN policy model
features	64	32,64,128,256	Width of message passing layers
num_messages	3	1 - 12	Number of messages to use in the Graph Neural Network
RL Run Paramters
iterations	10	2-1000	During each iteration, a certain number of “episodes” are run. Each episode is building a molecule and calculating its reward. The number of episodes is set by the “train_batch_size” option.
#-of-rollout workers	1	N/A	Number of CPU threads available for rollout workers. Workflow will show maximum of threads available. If this is set to less than the maximum available threads, the excess threads will be distributed to additional grid search parameters (if a grid search is specified)