
Sign up to save your podcasts
Or


Coauthored with Mark Kurzeja
A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It's hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines and reflect on lessons learned.
I'd also like to note that @Ryan Greenblatt's skepticism [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongCoauthored with Mark Kurzeja
A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It's hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.
This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines and reflect on lessons learned.
I'd also like to note that @Ryan Greenblatt's skepticism [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,664 Listeners

130 Listeners

7,216 Listeners

530 Listeners

16,132 Listeners

4 Listeners

14 Listeners

2 Listeners