How to Write a Customized Environment in ReinforcementLearning.jl?
Last Update: 2021-01-30T21:08:01.778
Julia Version: v"1.5.3"
ReinforcementLearning.jl Version: v"0.8.0"
The first step to apply algorithms in ReinforcementLearning.jl is to define the problem you want to solve in a recognizable way. Here we'll demonstrate how to write many different kinds of environments based on interfaces defined in ReinforcementLearningBase.jl
The most commonly used interfaces to describe reinforcement learning tasks is OpenAI/Gym. Inspired by it, we expand those interfaces a little to utilize the multiple-dispatch in Julia and to cover multi-agent environments.
The Minimal Interfaces to Implement
Many interfaces in ReinforcementLearningBase.jl have a default implementation. So in most cases, you only need to implement the following functions to define a customized environment:
action_space(env::YourEnv)
state(env::YourEnv)
state_space(env::YourEnv)
reward(env::YourEnv)
is_terminated(env::YourEnv)
reset!(env::YourEnv)
(env::YourEnv)(action)
An Example: The LotteryEnv
Here we use an example introduced in Monte Carlo Tree Search: A Tutorial to demonstrate how to write a simple environment.
The game is defined like this: assume you have $10 in your pocket, and you are faced with the following three choices:
Buy a PowerRich lottery ticket (win $100M w.p. 0.01; nothing otherwise);
Buy a MegaHaul lottery ticket (win $1M w.p. 0.05; nothing otherwise);
Do not buy a lottery ticket.
This game is a one-shot game. It terminates immediately after taking an action and a reward is received. First we define a concrete subtype of AbstractEnv named LotteryEnv:
xxxxxxxxxxusing ReinforcementLearningLotteryEnvxxxxxxxxxxBase. mutable struct LotteryEnv <: AbstractEnv reward::Union{Nothing, Int} = nothingendLotteryEnv has only one field named reward, by default it is initialized with nothing. Now let's implement the necessary interfaces:
xxxxxxxxxxRLBase.action_space(env::LotteryEnv) = (:PowerRich, :MegaHaul, nothing)Here RLBase is just an alias for ReinforcementLearningBase.
xxxxxxxxxxbegin RLBase.reward(env::LotteryEnv) = env.reward RLBase.state(env::LotteryEnv) = !isnothing(env.reward) RLBase.state_space(env::LotteryEnv) = [false, true] RLBase.is_terminated(env::LotteryEnv) = !isnothing(env.reward) RLBase.reset!(env::LotteryEnv) = env.reward = nothingendBecause the lottery game is just a simple one-shot game. If the reward is nothing then the game is not started yet and we say the game is in state false, otherwise the game is terminated and the state is true. So the result of state_space(env) describes the possible states of this environment. By reset! the game, we simply assign the reward with nothing, meaning that it's in the initial state again.
The only left one is to implement the game logic:
xxxxxxxxxxfunction (x::LotteryEnv)(action) if action == :PowerRich x.reward = rand() < 0.01 ? 100_000_000 : -10 elseif action == :MegaHaul x.reward = rand() < 0.05 ? 1_000_000 : -10 elseif isnothing(action) x.reward = 0 else "unknown action of $action" endendTest Your Environment
A method named RLBase.test_runnable! is provided to rollout several simulations and see whether the environment we defined is functional.
# LotteryEnv
## Traits
| Trait Type | Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.SingleAgent() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Stochastic() |
| RewardStyle | ReinforcementLearningBase.StepReward() |
| UtilityStyle | ReinforcementLearningBase.GeneralSum() |
| ActionStyle | ReinforcementLearningBase.MinimalActionSet() |
| StateStyle | ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Any}() |
## Is Environment Terminated?
No
## State Space
`Bool[0, 1]`
## Action Space
`(:PowerRich, :MegaHaul, nothing)`
## Current State
```
false
```
xxxxxxxxxxenv = LotteryEnv()"random policy with LotteryEnv"
2000
false
xxxxxxxxxxRLBase.test_runnable!(env)It is a simple smell test which works like this:
for _ in 1:n_episode
reset!(env)
while !is_terminated(env)
env |> action_space |> rand |> env
end
end
One step further is to test that other components in ReinforcementLearning.jl also work. Similar to the test above, let's try the RandomPolicy first:
xxxxxxxxxxusing Randomxxxxxxxxxxrun(RandomPolicy(action_space(env)), env, StopAfterEpisode(1_000)) If no error shows up, then it means our environment at least works with the RandomPolicy 🎉🎉🎉. Next, we can add a hook to collect the reward in each episode to see the performance of the RandomPolicy.
xxxxxxxxxxusing Plotsxxxxxxxxxxbegin hook = TotalRewardPerEpisode() run(RandomPolicy(action_space(env)), env, StopAfterEpisode(1_000), hook) plot(hook.rewards)endA random policy is usually not very meaningful. Here we'll use a tabular based monte carlo method to estimate the state-action value. (You may choose appropriate algorithms based on the problem you're dealing with.)
xxxxxxxxxxusing Flux:InvDecayQBasedPolicy
├─ learner => MonteCarloLearner
│ ├─ approximator => TabularApproximator
│ │ ├─ table => 3×2 Array{Float64,2}
│ │ └─ optimizer => InvDecay
│ │ ├─ gamma => 1.0
│ │ └─ state => IdDict
│ ├─ γ => 1.0
│ ├─ kind => ReinforcementLearningZoo.FirstVisit
│ └─ sampling => ReinforcementLearningZoo.NoSampling
└─ explorer => EpsilonGreedyExplorer
├─ ϵ_stable => 0.1
├─ ϵ_init => 1.0
├─ warmup_steps => 0
├─ decay_steps => 0
├─ step => 1
├─ rng => Random._GLOBAL_RNG
└─ is_training => true
xxxxxxxxxxp = QBasedPolicy( learner = MonteCarloLearner(; approximator=TabularQApproximator( ;n_state = length(state_space(env)), n_action = length(action_space(env)), opt = InvDecay(1.0) ) ), explorer = EpsilonGreedyExplorer(0.1))MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)
Closest candidates are:
Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30
Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Bool)@monte_carlo_learner.jl:45
- (::ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling})(::Main.workspace3.LotteryEnv)@monte_carlo_learner.jl:44
- (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv, ::ReinforcementLearningBase.MinimalActionSet, ::Tuple{Symbol,Symbol,Nothing})@q_based_policy.jl:27
- (::ReinforcementLearningCore.QBasedPolicy{ReinforcementLearningZoo.MonteCarloLearner{ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay},ReinforcementLearningZoo.FirstVisit,ReinforcementLearningZoo.NoSampling},ReinforcementLearningCore.EpsilonGreedyExplorer{:linear,false,Random._GLOBAL_RNG}})(::Main.workspace3.LotteryEnv)@q_based_policy.jl:21
- top-level scope@Local: 1[inlined]
xxxxxxxxxxp(env)Oops, we get an error here. So what does it mean?
Before answering this question, let's spend some time on understanding the policy we defined above. A QBasedPolicy contains two parts: a learner and an explorer. The learner learn the state-action value function (aka Q function) duiring interactions with the env. The explorer is used to select an action based on the Q value returned by the learner. Here the EpsilonGreedyExplorer(0.1) will select the action of the largest value with probability 0.9 and select a random one with probability 0.1. Inside of the MonteCarloLearner, a TabularQApproximator is used to estimate the Q value.
That's the problem! A TabularQApproximator only accepts states of type Int.
0.0xxxxxxxxxxp.learner.approximator(1, 1) # Q(s, a)0.0
0.0
0.0
xxxxxxxxxxp.learner.approximator(1) # [Q(s, a) for a in action_space(env)]MethodError: no method matching (::ReinforcementLearningCore.TabularApproximator{2,Array{Float64,2},Flux.Optimise.InvDecay})(::Bool)
Closest candidates are:
Any(!Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:30
Any(!Matched::Int64, !Matched::Int64) at /home/tj/.julia/packages/ReinforcementLearningCore/LcIgw/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:31
- top-level scope@Local: 1[inlined]
xxxxxxxxxxp.learner.approximator(false)OK, now we know where the problem is. But how to fix it?
A initial idea is to rewrite the RLBase.state(env::LotteryEnv) function to force it return an Int. That's workable. But in some cases, we may be using environments written by others and it's not very easy to modify the code directly. Fortunatelly, some built-in wrappers are provided to help us transform the environment.
# LotteryEnv |> StateOverriddenEnv |> ActionTransformedEnv
## Traits
| Trait Type | Value |
|:----------------- | ------------------------------------------------:|
| NumAgentStyle | ReinforcementLearningBase.SingleAgent() |
| DynamicStyle | ReinforcementLearningBase.Sequential() |
| InformationStyle | ReinforcementLearningBase.ImperfectInformation() |
| ChanceStyle | ReinforcementLearningBase.Stochastic() |
| RewardStyle | ReinforcementLearningBase.StepReward() |
| UtilityStyle | ReinforcementLearningBase.GeneralSum() |
| ActionStyle | ReinforcementLearningBase.MinimalActionSet() |
| StateStyle | ReinforcementLearningBase.Observation{Any}() |
| DefaultStateStyle | ReinforcementLearningBase.Observation{Any}() |
## Is Environment Terminated?
Yes
## State Space
`Bool[0, 1]`
## Action Space
`Base.OneTo(3)`
## Current State
```
1
```
xxxxxxxxxxwrapped_env = ActionTransformedEnv( StateOverriddenEnv( env, s -> s ? 1 : 2 ), action_space_mapping = _ -> Base.OneTo(3), action_mapping = i -> action_space(env)[i])1xxxxxxxxxxp(wrapped_env)Nice job! Now we are ready to run the experiment:
xxxxxxxxxxbegin h = TotalRewardPerEpisode() run(p, wrapped_env, StopAfterEpisode(1_000), h) plot(h.rewards)endIf you are observant enough, you'll find that our policy is not updating at all!!!
3×2 Array{Float64,2}:
0.0 0.0
0.0 0.0
0.0 0.0xxxxxxxxxxp.learner.approximator.tableWell, actually the policy is running in the evaluation mode here. We'll explain it in another blog. For now, you only need to know that we can wrap the policy in an Agent to train the policy.
Agent
├─ policy => QBasedPolicy
│ ├─ learner => MonteCarloLearner
│ │ ├─ approximator => TabularApproximator
│ │ │ ├─ table => 3×2 Array{Float64,2}
│ │ │ └─ optimizer => InvDecay
│ │ │ ├─ gamma => 1.0
│ │ │ └─ state => IdDict
│ │ ├─ γ => 1.0
│ │ ├─ kind => ReinforcementLearningZoo.FirstVisit
│ │ └─ sampling => ReinforcementLearningZoo.NoSampling
│ └─ explorer => EpsilonGreedyExplorer
│ ├─ ϵ_stable => 0.1
│ ├─ ϵ_init => 1.0
│ ├─ warmup_steps => 0
│ ├─ decay_steps => 0
│ ├─ step => 1002
│ ├─ rng => Random._GLOBAL_RNG
│ └─ is_training => true
└─ trajectory => Trajectory
└─ traces => NamedTuple
├─ state => 0-element Array{Int64,1}
├─ action => 0-element Array{Int64,1}
├─ reward => 0-element Array{Float32,1}
└─ terminal => 0-element Array{Bool,1}
xxxxxxxxxxagent = Agent(; policy=p, trajectory=VectorSARTTrajectory())0.0
xxxxxxxxxxnew_hook = TotalRewardPerEpisode()-10.0
-10.0
0.0
0.0
-10.0
-10.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-10.0
0.0
0.0
0.0
0.0
1.0e6
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
0.0
xxxxxxxxxxrun(agent, wrapped_env, StopAfterStep(100_000), new_hook) 3×2 Array{Float64,2}:
0.0 1.00773e6
0.0 47660.8
0.0 0.0xxxxxxxxxxp.learner.approximator.table Note
Always remember that each algorithm usually only works in some specific environments, just like the `QBasedPolicy` above. So choose the right tool wisely 😉.
More Complicated Environments
The above LotteryEnv is quite simple. Many environments we are interested in fall in the same category. Beyond that, there're still many other kinds of environments. You may take a glimpse at the table to see how many different types of environments are supported in ReinforcementLearningZoo.jl.
To distinguish different kinds of environments, some common traits are defined in ReinforcementLearningBase.jl. Now we'll explain them one-by-one.
StateStyle
In the above LotteryEnv, state(env::LotteryEnv) simply returns a true or false. But in some other environments, the function name state may be kind of vague. People with different background often talk about the same thing with different names. You may be interested in this discussion: What is the difference between an observation and a state in reinforcement learning? To avoid confusion when executing state(env), the environment designer can explicitly define state(::AbstractStateStyle, env::YourEnv). So that users can fetch necessary information on demand. Following are some built-in state styles:
GoalState
InformationSet
InternalState
Observation
xxxxxxxxxxsubtypes(RLBase.AbstractStateStyle)Note that every state style may have different representations, String, Array, Graph and so on. All the above state styles can accept a data type as parameter. For example:
xxxxxxxxxxRLBase.state(::Observation{String}, env::LotteryEnv) = is_terminated(env) ? "Game Over" : "Game Start"For environments which support many different kinds of states, developers should specify all the supported state styles. For example:
xxxxxxxxxxtp = TigerProblemEnv();xxxxxxxxxxStateStyle(tp)1xxxxxxxxxxstate(tp, Observation{Int64}())2xxxxxxxxxxstate(tp, InternalState{Int64}())1xxxxxxxxxxstate(tp)xxxxxxxxxxDefaultStateStyle(tp)DefaultStateStyle
The DefaultStateStyle trait returns the first element in the result of StateStyle by default.
For algorithm developers, they usually don't care about the state style. They can assume that the default state style is always well defined and simply call state(env) to get the right representation. So for environments of many different representations, state(env) will be dispatched to state(DefaultStateStyle(env), env). And we can use the DefaultStateStyleEnv wrapper to override the pre-defined DefaultStateStyle(::YourEnv).
RewardStyle
For games like Chess, Go or many card game, we only get the reward at the end of an game. We say this kind of games is of TerminalReward, otherwise we define it as StepReward. Actually the TerminalReward is a special case of StepReward (for non-terminal steps, the reward is 0). The reason we still want to distinguish these two cases is that, for some algorithms there may be a more efficient implementation for TerminalReward style games.
xxxxxxxxxxRewardStyle(tp)xxxxxxxxxxRewardStyle(MontyHallEnv())ActionStyle
For some environments, the valid actions in each step may be different. We call this kind of environments are of FullActionSet. Otherwise, we say the environment is of MinimalActionSet. A typical built-in environment with FullActionSet is the TicTacToeEnv. Two extra methods must be implemented:
xxxxxxxxxxttt = TicTacToeEnv();xxxxxxxxxxActionStyle(ttt)1
2
3
4
5
6
7
8
9
xxxxxxxxxxlegal_action_space(ttt)true
true
true
true
true
true
true
true
true
xxxxxxxxxxlegal_action_space_mask(ttt)NumAgentStyle
In the above LotteryEnv, only one player is involved in the environment. In many board games, usually multiple players are engaged.
xxxxxxxxxxNumAgentStyle(env)xxxxxxxxxxNumAgentStyle(ttt)For multi-agent environments, some new APIs are introduced. The meaning of some APIs we've seen are also extended.
First, multi-agent environment developers must implement players to distinguish different players.
xxxxxxxxxxplayers(ttt)xxxxxxxxxxcurrent_player(ttt)| Single Agent | Multi-Agent |
|---|---|
state(env) | state(env, player) |
reward(env) | reward(env, player) |
env(action) | env(action, player) |
action_space(env) | action_space(env, player) |
state_space(env) | state_space(env, player) |
is_terminated(env) | is_terminated(env, player) |
Note that the APIs in single agent is still valid, only that they all fall back to the perspective from the current_player(env).
UtilityStyle
In multi-agent environments, sometimes the sum of rewards from all players are always 0. We call the UtilityStyle of these environments ZeroSum. ZeroSum is a special case of ConstantSum. In cooperational games, the reward of each player are the same. In this case, they are called IdenticalUtility. Other cases fall back to GeneralSum.
InformationStyle
If all players can see the same state, then we say the InformationStyle of these environments are of PerfectInformation. They are a special case of ImperfectInformation environments.
DynamicStyle
All the environments we've seen so far were of Sequential style, meaning that at each step, only ONE player was allowed to take an action. Alternatively there are Simultaneous environments, where all the players take actions simultaneously without seeing each other's action in advance. Simultaneous environments must take a collection of actions from different players as input.
xxxxxxxxxxrps = RockPaperScissorsEnv();'💎'
'💎'
'💎'
'📃'
'💎'
'✂'
'📃'
'💎'
'📃'
'📃'
'📃'
'✂'
'✂'
'💎'
'✂'
'📃'
'✂'
'✂'
xxxxxxxxxxaction_space(rps)truexxxxxxxxxxrps(rand(action_space(rps)))ChanceStyle
If there's no rng in the environment, everything is deterministic afer taking each action, then we call the ChanceStyle of these environments are of Deterministic. Otherwise, we call them Stochastic. One special case is that, in Extensive Form Games, a chance node is envolved. And the action probability of this special player is known. For these environments, we need to have the following methods defined:
xxxxxxxxxxkp = KuhnPokerEnv();xxxxxxxxxxchance_player(kp)0.333333
0.333333
0.333333
xxxxxxxxxxprob(kp, chance_player(kp))truexxxxxxxxxxchance_player(kp) in players(kp)Examples
Finally we've gone through all the details you need to know for how to write a customized environment. You're encouraged to take a look at the examples provided in ReinforcementLearningEnvironments.jl. Feel free to create an issue there if you're still not sure how to describe your problem with the interfaces defined in this package.