.. include:: ../color_roles.rst Introduction to fragile ======================== .. note:: The notebook version of this example is available in the `examples `_ section as `01_getting_started.ipynb `_ This is a tutorial that explains how to crate a :swarm:`Swarm` to sample Atari games from the `OpenAI gym `_ library. It covers how to instantiate a :class:`Swarm` and the most important parameters needed to control the sampling process. Structure of a :swarm:`Swarm` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :swarm:`Swarm` is the class that implements the algorithm’s evolution loop, and controls all the other classes involved in solving a given problem: .. figure:: ../images/fragile_architecture.png :alt: swarm architecture swarm architecture For every problem we want to solve, we will need to define callables that return instances of the following classes: - :env:`Environment`: Represents problem we want to solve. Given states and actions, it returns the next state. - :model:`Model`: It provides an strategy for sampling actions (Policy). - :walkers:`Walkers`: This class handles the computations of the evolution process of the algorithm. - :tree:`StateTree`: It stores the history of states samples by the :swarm:`Swarm`. - :critic:`Critic`: This class implements additional computation, such as a new reward, or extra values for our policy. Choosing to pass callables to the :swarm:`Swarm` instead of instances is a design decision that simplifies the deployment at scale in a cluster, because it avoids writing tricky serialization code for all the classes. Defining the :env:`Environment` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For playing Atari games we will use the interface provided by the `plangym `__ package. It is a wrapper of OpenAI ``gym`` that allows to easily set and recover the state of the environments, as well as stepping the environment with batches of states. The following code will initialize a ``plangym.Environment`` for an OpenAI ``gym`` Atari game. The game names use the same convention as the OpenAI ``gym`` library. In order to use a ``plangym.Environment`` in a :swarm:`Swarm` we will need to define the appropriate Callable object to pass as a parameter. ``fragile`` incorporates a wrapper to use a ``plangym.AtariEnvironment`` that will take care of matching the ``fragile`` API and constructing the appropriate :env-st:`StatesEnv` class to store its data. The environment callable does not take any parameters, and must return an instance of :env:`fragile.BaseEnvironment`. .. code:: ipython3 from plangym import AtariEnvironment from fragile.core import DiscreteEnv def atari_environment(): game_name = "MsPacman-ram-v0" plangym_env = AtariEnvironment( name=game_name, clone_seeds=True, autoreset=True, ) return DiscreteEnv(env=plangym_env) Using the ``ParallelEnv`` located in the `distributed` module. It is a wrapper that allows any `Environment` to be run in parallel using the `multiprocessing` module. It takes a Callable object that returns an `Environment` and it spawns the target number of processed to run the `make_transitions` function of the `Environment` in parallel. .. code:: ipython3 from fragile.distributed import ParallelEnv env_callable = lambda: ParallelEnv(atari_environment, n_workers=4) Defining the :model:`Model` ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :model:`Model` defines the policy that will be used to sample the :env:`Environment`. In this tutorial we will be using a random sampling strategy over a discrete uniform distribution. This means that every time we sample an action, the :model:`Model` will return an integer in the range \[0, N_actions\] for each state. By default each action will be applied for one time step. In case you want to apply the actions a different number of time steps it is possible to use a :critic:`Critic`. The Critics that allow you to sample time steps values for the actions can be found in `fragile.core.dt_samplers`. In this example we will apply each sampled action a variable number of time steps using the :critic:`GaussianDt`. The :class:`GaussianDt` draws the number of time steps for each action from a normal distribution. The model callable passed to the :swarm:`Swarm` takes as a parameter the :env:`Environment` and returns an instance of :model:`Model`. .. code:: ipython3 from fragile.core import GaussianDt, GaussianDt dt = GaussianDt(min_dt=3, max_dt=1000, loc_dt=4, scale_dt=2) model_callable = lambda env: DiscreteUniform(env=env, critic=dt) Storing the sampled data inside a :tree:`HistoryTree` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to keep track of the sampled data by using a :tree:`HistoryTree`. This data structure will construct a directed acyclic graph that will contain the sampled states and their transitions. Passing the ``prune_tree`` parameter to the :class:`HistoryTree` we can choose to store only the branches of the :tree:`HistoryTree` that are being explored. If ``prune_tree`` is ``True`` all the branches of the graph with no walkers will be removed after every iteration, and if it is ``False`` all the visited states will be kept in memory. In order to save memory we will be setting it to ``True``. With the ``names`` attribute, we can control the data that is stored in the search tree. In this example we will be saving the state of the atari emulator, the actions and the number of times the actions have been applied at every environment step. .. code:: ipython3 from fragile.core.tree import HistoryTree tree_callable = lambda: HistoryTree(names=["states", "actions", "dt"], prune=True) Initializing a Swarm ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once we have defined the problem-specific callables for the :model:`Model` and the :env:`Environment`, we need to define the parameters used by the algorithm: - ``n_walkers``: This is population size of our algorithm. It defines the number of different states that will be explored simultaneously at every iteration of the algorithm. It will be equal to the ``batch_size`` of the :states:`States` (size of the first dimension of the data they store). - ``max_epochs``: Maximum number of iterations that the :swarm:`Swarm` will execute. The algorithm will stop either when all the walkers reached a death condition, or when the maximum number of iterations is reached. - ``reward_scale``: Relative importance given to the :env:`Environment` reward with respect to the diversity score of the walkers. - ``distance_scale``: Relative importance given to the diversity measure of the walkers with respect to their reward. - ``minimize``: If ``True``, the :swarm:`Swarm` will try to sample states with the lowest reward possible. If ``False`` the :swarm:`Swarm` will undergo a maximization process. .. code:: ipython3 n_walkers = 64 # A bigger number will increase the quality of the trajectories sampled. max_epochs = 500 # Increase to sample longer games. reward_scale = 2 # Rewards are more important than diversity. distance_scale = 1 minimize = False # We want to get the maximum score possible. .. code:: ipython3 from fragile.core import Swarm swarm = Swarm( model=model_callable, env=env_callable, tree=tree_callable, n_walkers=n_walkers, max_epochs=max_epochs, prune_tree=prune_tree, reward_scale=reward_scale, distance_scale=distance_scale, minimize=minimize, ) By printing a :class:`Swarm` we can get an overview of the internal data it contains. .. code:: ipython3 print(swarm) .. parsed-literal:: Best reward found: -inf , efficiency 0.000, Critic: None Walkers iteration 0 Dead walkers: 0.00% Cloned: 0.00% Walkers States: id_walkers shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 compas_clone shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 processed_rewards shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 virtual_rewards shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 cum_rewards shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 distances shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 clone_probs shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 will_clone shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 alive_mask shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 end_condition shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 best_reward shape None Mean: nan, Std: nan, Max: nan Min: nan best_obs shape None Mean: nan, Std: nan, Max: nan Min: nan best_state shape None Mean: nan, Std: nan, Max: nan Min: nan distance_scale shape None Mean: nan, Std: nan, Max: nan Min: nan critic_score shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 Env States: states shape (64, 1021) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 rewards shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 ends shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 game_ends shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 Model States: actions shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 dt shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 Visualizing the sampling ^^^^^^^^^^^^^^^^^^^^^^^^^ We will be using the Atari visuzalizer to see how the sampling process evolves. For more information about how the visualizer works please refer to the ``dataviz`` module tutorial. .. code:: ipython3 from fragile.dataviz.swarm_viz import AtariViz import holoviews holoviews.extension("bokeh") swarm_viz = AtariViz(swarm, stream_interval=10) swarm_viz.plot() .. raw:: html
.. raw:: html
.. image:: ../images/01_pacman_output.gif Running the Swarm ^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to execute the algorithm we only need to call ``run_swarm``. It is possible to display the internal data of the :class:`Swarm` by using the ``print_every`` parameter. This parameter indicates the number of iterations that will pass before printing the :class:`Swarm`. .. code:: ipython3 _ = swarm_viz.run(report_interval=50) .. parsed-literal:: Best reward found: 6530.0000 , efficiency 0.808, Critic: None Walkers iteration 451 Dead walkers: 10.94% Cloned: 31.25% Walkers States: id_walkers shape (64,) Mean: 185485787032307136.000, Std: 4859332449700356096.000, Max: 8627791072916286464.000 Min: -9121479801778132992.000 compas_clone shape (64,) Mean: 32.891, Std: 19.501, Max: 63.000 Min: 0.000 processed_rewards shape (64,) Mean: 1.098, Std: 0.156, Max: 1.239 Min: 0.000 virtual_rewards shape (64,) Mean: 1.308, Std: 0.811, Max: 2.855 Min: 0.000 cum_rewards shape (64,) Mean: 6477.812, Std: 202.117, Max: 6530.000 Min: 4880.000 distances shape (64,) Mean: 1.048, Std: 0.590, Max: 1.986 Min: 0.100 clone_probs shape (64,) Mean: 0.654, Std: 2.171, Max: 10.511 Min: -0.934 will_clone shape (64,) Mean: 0.312, Std: 0.464, Max: 1.000 Min: 0.000 alive_mask shape (64,) Mean: 0.891, Std: 0.312, Max: 1.000 Min: 0.000 end_condition shape (64,) Mean: 0.109, Std: 0.312, Max: 1.000 Min: 0.000 best_reward shape None Mean: nan, Std: nan, Max: nan Min: nan best_obs shape (128,) Mean: 78.711, Std: 82.113, Max: 255.000 Min: 0.000 best_state shape (1021,) Mean: 45.375, Std: 76.181, Max: 255.000 Min: 0.000 distance_scale shape None Mean: nan, Std: nan, Max: nan Min: nan critic_score shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 Env States: states shape (64, 1021) Mean: 45.266, Std: 76.287, Max: 255.000 Min: 0.000 rewards shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 ends shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 game_ends shape (64,) Mean: 0.000, Std: 0.000, Max: 0.000 Min: 0.000 Model States: actions shape (64,) Mean: 3.750, Std: 2.562, Max: 8.000 Min: 0.000 dt shape (64,) Mean: 3.688, Std: 1.059, Max: 7.000 Min: 3.000 Visualizing the sampled game ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We will extract the branch of the :tree:`StateTree` that achieved the maximum reward and use its states and actions in the ``plangym.Environment``. This way we can render all the trajectory using the ``render`` method provided by the OpenAI gym API. The ``iterate_branch`` method of the :class:`HistoryTree` takes the id of an state. It returns the data stored for the path that starts at the root node of the search tree and finishes at the state with the provided id. .. code:: ipython3 import time from fragile.core.utils import get_plangym_env env = get_plangym_env(swarm) # Get the plangym environment used by the Swarm for s, a, dt in swarm.tree.iterate_branch(swarm.best_id): env.step(state=s, action=a, n_repeat_action=dt) env.render() time.sleep(0.05)