Deep Learning With The Wolf

The Wolf Reads AI – Day 10: Playing Atari with Deep Reinforcement Learning


Listen Later

Paper: Playing Atari with Deep Reinforcement Learning

Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller

Published: December 19, 2023 (Nature version 2015)

Link: https://arxiv.org/abs/1312.5602

🧠 What’s This Paper About?

Imagine dropping an eight‑bit rookie into Breakout with no rule book, only a flickering screen. (Just in case you were deprived of this coming of age experience, Breakout is an old Atari game where you bounce a ball to break bricks.) Ten million joystick wiggles later, it’s smashing virtual bricks like a seasoned gamer. That’s DeepMind’s Deep Q‑Network (DQN): the first agent to learn directly from pixels and beat human scores across a swath of Atari classics.

How It Works (Sans Equations)

Picture a kid who has never played Breakout. At first they just wiggle the joystick randomly. Every time the ball smashes a brick, they get a little jolt of satisfaction; every time they miss, they feel a tiny pang of regret.

That rookie = the neural network.

The jolts and pangs = rewards and penalties the game sends back.

The Eyes: A Tiny Vision System

Instead of seeing the whole TV screen in vibrant color, the kid is handed a small, black‑and‑white snapshot (84 × 84 pixels) of the last 4 frames. It’s like squinting at the TV through frosted glass—just enough to notice where the paddle, ball, and bricks are.

These snapshots feed into a Convolutional Neural Network (CNN)—basically a pattern detector that learns to recognize “ball coming left,” “paddle under ball,” and other useful visual cues.

The Brain: A Table of “If‑This‑Then‑That” Hunches

After the CNN spots patterns, it hands the information to a simple calculator that keeps six hunches—one for each joystick move (left, right, fire, etc.). Each hunch is a guess at “How many points will I rack up if I do this now?”

Those guesses are called Q‑values.

Highest guess → network presses that joystick direction.

Practice, Practice, Practice (With a Clever Notebook)

Every single move the rookie makes is written into a giant notebook:

This notebook is the experience‑replay buffer—millions of memories of success and failure.

Instead of learning only from the latest move (which can be noisy), the rookie shuffles the notebook and rereads random memories in mini study sessions. That random shuffle (replay) prevents them from fixating on one lucky streak or one bad moment.

A Calm, Older Sibling Gives Stable Advice

If the kid updates their hunches after every single play, they’ll swing wildly—one lucky bounce could make “go left” seem golden even when it’s not.

So DQN keeps a second, frozen copy of its own brain—the target network. Every so often (about every 10,000 moves), the frozen copy is updated. During study sessions the rookie asks, “Big sib, based on your cooler head, how good was that choice?” That periodic update stops runaway optimism or pessimism.

The Exploration Trick

At first the rookie still needs to mash buttons randomly (explore) to discover tricks. So the system flips a weighted coin every move:

* Heads (rare as it gets smarter): try a random move.

* Tails: pick the move with the highest hunch.

As training goes on, the coin is rigged to come up tails more often—meaning more exploitation of hard‑won skills and less random flailing.

Graduation Day

After about 200 million game frames (roughly 38 days of nonstop play on fast‑forward), the rookie internalizes “When the ball is here and paddle is there, nudge right,” with near‑reflex speed. It now outperforms most human players—without ever being told the rules.

Real-World Ripples

Warehouse bots optimise picking paths; energy grids schedule power; YouTube recommends cat videos—each a cousin of DQN.

TL;DR

* See pixels.

* Guess which joystick move leads to more points.

* Write every experience in a big notebook.

* Study shuffled memories, using advice from a calm frozen twin.

* Repeat until the guesses become so good the rookie looks like a pinball wizard.

That’s DQN—a clever blend of vision (CNN), trial‑and‑error learning (Q‑values + rewards), and good study habits (experience replay + a stable tutor).

I🎧 Podcast Note

Today’s podcast episode was produced with the Audio Overview tool in Google NotebookLM. The sources used to create the “notebook” included all of the sources listed below, plus this article. The hosts you hear are AI-generated.

📚 Appendix A: Sources

1] https://arxiv.org/abs/1312.5602

[2] https://research.google/blog/from-pixels-to-actions-human-level-control-through-deep-reinforcement-learning/

[3] https://paperswithcode.com/method/dqn

[4] https://fanpu.io/assets/research/atari_deeprl.pdf

[5] https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

[6] https://desirivanova.com/uploads/202103_deepprob_DQN.pdf

[7] https://www.pbs.org/newshour/science/artificial-intelligence-program-teaches-play-atari-games-can-beat-high-score

[8] http://artent.net/2014/12/10/a-review-of-playing-atari-with-deep-reinforcement-learning/

[9] http://llcao.net/cu-deeplearning15/presentation/DeepMindNature-preso-w-David-Silver-RL.pdf

[10] https://github.com/adhiiisetiawan/atari-dqn

[11] https://www.alexirpan.com/2018/02/14/rl-hard.html

[12]https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

#playingatariwithdeepreinforcementlearning #googlebrain #30daysofAIpapers #deeplearning #deeplearningwiththewolf #AIfundamentals #AIbasics



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit dianawolftorres.substack.com
...more
View all episodesView all episodes
Download on the App Store

Deep Learning With The WolfBy Diana Wolf Torres