From 173b2bb6e0e3b2feb7587c98bb54f63b1d3867d5 Mon Sep 17 00:00:00 2001
From: Kevin Black <12429600+kvablack@users.noreply.github.com>
Date: Thu, 13 Jul 2023 12:37:22 -0700
Subject: [PATCH] Update README.md (add reward curves)

---
 README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)
diff --git a/README.md b/README.md
index 3d0d1c6..c521e06 100644
--- a/README.md
+++ b/README.md
@@ -46,3 +46,14 @@ accelerate launch scripts/train.py --config config/dgx.py:aesthetic
 ```
 
 If you want to run the LLaVA prompt-image alignment experiments, you need to dedicate a few GPUs to running LLaVA inference using [this repo](https://github.com/kvablack/LLaVA-server/).
+
+## Reward Curves
+<img src="https://github.com/kvablack/ddpo-pytorch/assets/12429600/593c9be3-e2a7-45d8-b1ae-ca4f77197c18" width="49%">
+<img src="https://github.com/kvablack/ddpo-pytorch/assets/12429600/d12fef0a-68b8-4cef-a9b8-cb1b6878fcec" width="49%">
+<img src="https://github.com/kvablack/ddpo-pytorch/assets/12429600/68c6a7ac-0c31-4de6-a7a0-1f9bb20202a4" width="49%">
+<img src="https://github.com/kvablack/ddpo-pytorch/assets/12429600/393a929e-36af-46f2-8022-33384bdae1c8" width="49%">
+
+As you can see with the aesthetic experiment, if you run for long enough the algorithm eventually experiences instability. This might be remedied by decaying the learning rate. Interestingly, however, the actual qualitative samples you get after the instability are mostly fine -- the drop in the mean is caused by a few low-scoring outliers. This is clear from the full reward histogram, which you can see if you go to an individual run in wandb.
+
+<img src="https://github.com/kvablack/ddpo-pytorch/assets/12429600/eda43bef-6363-45b5-829d-466502e0a0e3" width="50%">
+