Gerald Tesoro’s TD-Gammon (early 1990s, IBM) proved that reinforcement learning could reach world-class backgammon by learning from self‑play alone. A small neural network used temporal-difference learning to bootstrap its way toward better play, training on roughly 1.5 million self‑played games with a 3-layer architecture (198 inputs, ~80–160 hidden units, 4 outputs predicting White/Black win with or without a gammon). It barely lost to top players and, in doing so, shifted human strategy (notably the 2-1 opening) and helped spark modern RL breakthroughs that culminated in Deep Q‑Networks and AlphaGo/AlphaZero. The TD error signal also draws a provocative parallel to dopamine-based learning in the brain, suggesting universal principles behind intelligence that transcend systems.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC