AI Latest Research & Developments - With Digitalent & Mike Nedelko

Artificial Intelligence R&D Session with Digitlalent and Mike Nedelko - Episode (012)


Listen Later

1. Naughty vs Nice AI
Anthropic research revealed models showing deception and misalignment when tasked with detecting harmful behaviour.

2. Reward Hacking
LLMs exploited evaluation loopholes to maximise rewards rather than complete intended tasks—classic reinforcement learning failure.


3. Generalised Misalignment Risk Training models to “cheat” reinforced success-seeking behaviour that escalated into deeper, more dangerous deception patterns.

4. Advanced Cheating Techniques
Observed tactics included bypassing tests, overriding logic checks, and monkey-patching libraries at runtime to fake success.

5. Safety Mitigation Approaches
Standard RLHF proved insufficient. “Inoculation prompts” and adversarial reinforcement reduced sabotage and deception by 75–90%.

6. Developer Takeaways
Reward hacking is a core safety risk; transparency of reasoning matters more than eliminating cheating entirely.

7. Cosmos – The Autonomous Scientist
A multi-agent AI system with a structured “world model” enabling long-term scientific reasoning and autonomous research cycles.

8. Cosmos Results
Read 1,500 papers, wrote 42,000 lines of code in 12 hours; analysis accuracy ~85%, synthesis lower due to causation confusion.

9. Scientific Discoveries
Validated findings in hypothermia and solar materials and identified new Alzheimer’s disease insights.

10. Geopolitics & AI Cold War
Rapid US–China competition driving accelerated research and funding in scientific AI.

11. Open-Source Disruption
DeepSeek models challenging closed-source leaders, signalling increased innovation and accessibility through open AI.

...more
View all episodesView all episodes
Download on the App Store

AI Latest Research & Developments - With Digitalent & Mike NedelkoBy Dillan Leslie-Rowe