
Sign up to save your podcasts
Or
Coauthored with Mark Kurzeja
A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It's hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines and reflect on lessons learned.
I'd also like to note that @Ryan Greenblatt's skepticism [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
Coauthored with Mark Kurzeja
A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It's hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines and reflect on lessons learned.
I'd also like to note that @Ryan Greenblatt's skepticism [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,331 Listeners
2,403 Listeners
7,873 Listeners
4,105 Listeners
87 Listeners
1,449 Listeners
8,765 Listeners
90 Listeners
350 Listeners
5,370 Listeners
14,993 Listeners
468 Listeners
128 Listeners
72 Listeners
438 Listeners