Update 04_ppo_with_sb3.ipynb

daphne-cornelisse · web-flow · commit ae2a0643e73a · 2024-02-17T12:13:34.000-05:00
diff --git a/examples/04_ppo_with_sb3.ipynb b/examples/04_ppo_with_sb3.ipynb
@@ -130,7 +130,7 @@
     "\n",
     "Now take a look at information you've logged over training; did we learn?\n",
     "\n",
-    "One important metric for assess the effectiveness of your policy is the average cumulative reward per episode. In our case, the **maximum** achievable return per episode is approximately between 9 and 10 (it varies per traffic scene and per agent). With the configurations above, your policy should approach this value in 150,000 steps. Here, steps (the `global_step`) represents the total number of **frames** our policy network has seen, you can think of it as the accumulated experience."
+    "One important metric for assess the effectiveness of your policy is the average cumulative reward per episode. In our case, the **maximum** achievable return per episode is 1 per agent. With the configurations above, your policy should approach this value in 150,000 steps. Here, steps (the `global_step`) represents the total number of **frames** our policy network has seen, you can think of it as the accumulated experience."
    ]
   }
  ],

Original file line number	Diff line number	Diff line change
`@@ -130,7 +130,7 @@`
`130`	`130`	`"\n",`
`131`	`131`	`"Now take a look at information you've logged over training; did we learn?\n",`
`132`	`132`	`"\n",`
`133`		- "One important metric for assess the effectiveness of your policy is the average cumulative reward per episode. In our case, the maximum achievable return per episode is approximately between 9 and 10 (it varies per traffic scene and per agent). With the configurations above, your policy should approach this value in 150,000 steps. Here, steps (the `global_step`) represents the total number of frames our policy network has seen, you can think of it as the accumulated experience."
	`133`	+ "One important metric for assess the effectiveness of your policy is the average cumulative reward per episode. In our case, the maximum achievable return per episode is 1 per agent. With the configurations above, your policy should approach this value in 150,000 steps. Here, steps (the `global_step`) represents the total number of frames our policy network has seen, you can think of it as the accumulated experience."
`134`	`134`	`]`
`135`	`135`	`}`
`136`	`136`	`],`