In this episode of Big Ideas Only, host Mikkel Svold takes a theoretical deep dive into how computers “see” with Andreas Møgelmose (Associate Professor of AI, Aalborg University; Visual Analysis & Perception Lab).
We unpack the neural-network ideas behind modern vision, why 2012 was a turning point, how convolutional networks work, the difference between training, fine-tuning and adding context, plus explainability, bias traps, multimodality, and what still needs solving.
In this episode, you’ll learn about:
How a 2012 vision breakthrough reshaped speech and language research2. Neural networks explained simply — how they learn patterns from data
3. CNNs: how computers spot shapes and textures in images
4. Training, fine-tuning, and adding context to make models smarter
5. From hand-crafted features to fully data-driven learning
6. Explainability: the “ruler in skin-cancer photos” bias trap and what it teaches us
7. Multimodal systems: models combining text, images, and tools
8. Depth sensing with stereo, lidar, radar, and time-of-flight — and when 3D is essential
9. Privacy and governance: why real risk lies in implementation, not vision itself
10. Open challenges: fine-grained recognition, explainability, and machine unlearning
11. The pace of progress: steady research with headline-making leaps
Episode Content
01:09 How computer vision differs from other AI fields
01:16 The 2012 breakthrough: neural networks in vision that spread to speech and text
04:05 Neural networks 101: neurons, weights, and simple math scaled up to complex decisions
07:06 Training at scale: millions of images, pretraining, and fine-tuning for specific tasks
10:39 Fine-tuning vs. adding context in large language models; backpropagation explained
16:52 Layered learning: from edges to shapes, faces, and full objects
18:22 Before deep learning: feature engineering and why it hit its limits
20:44 How it’s built: data collection, architecture design, training loops, and learning plateaus
22:54 Bias pitfalls: the “ruler in skin-cancer photos” example and why explainability matters
25:23 Regulation and trust: high-risk uses and the demand for transparency
26:13 Connecting vision to action: from black-box outputs to robots with “vision in the loop”
27:41 Ensemble systems: language models coordinating other models (e.g., text-to-image)
29:03 True multimodality: training models jointly on text and images
30:17 AGI reflections: embodiment, experience, and the limits of data
32:44 Human vision vs. computer vision: depth of field, aperture, and why machines see everything in focus
34:40 Is progress slowing or steady? Research milestones versus quiet, continuous work
36:43 Public perception: many versions, but most still see “just ChatGPT”
37:41 Why the research pace feels natural — more people means faster progress
This podcast is produced by Montanus.