O'Reilly Data Show Podcast

How machine learning can be used to write more secure computer programs


Listen Later

In this episode of the Data Show, I spoke with Fabian Yamaguchi, chief scientist at ShiftLeft. His 2015 Ph.D. dissertation sketched out how the combination of static analysis, graph mining, and machine learning, can be used to develop tools to augment security analysts. In a recent post, I argued for machine learning tools to augment teams responsible for deploying and managing models in production (machine learning engineers). These are part of a general trend of using machine learning to develop and manage the software systems of tomorrow. Yamaguchi’s work is step one in this direction: using machine learning to reduce the number of security vulnerabilities in complex software products.
Here are some highlights from our conversation:
Machine learning to find code vulnerabilities
I was not trying to build something that would just automatically take the code and give you all of the vulnerabilities. Instead, I was looking at the typical kind of tasks that I would encounter myself when doing these security audits, and I would ask myself, how can I automate these subtasks? As an example, when you find a vulnerability in code, the question that often arises is whether there are similar vulnerabilities still in that same program. That’s one of those subtasks you can automate well because what you’re actually doing is saying: ‘Hey, here’s an example of what a bug looks like. Can you scan the rest of the code? Can you use machine learning to actually determine other locations in the code that implement the same bug?’
… In machine learning, you never have enough data. In this case, this is actually an unsupervised learning approach. You’re taking all of the functions that you can get and you extract the dominant programming patterns in there. … It’s a bit like what you would do to find similar text documents, but it’s used for code.
From source code to graph analytics
By transforming software code into a graph, you can actually extract different properties from that code by analyzing the graph.
… Let’s take a smaller function that might have one IF block. One of the graph structures that’s first generated is called an abstract syntax tree. That’s a tree that you’d get by just parsing the code. …  For each IF and for each variable, for each statement, there’s going to be a node. For each operator, like if there’s an assignment, there’s also going to be a node, and they are all connected by edges. You soon run into a lot of nodes and edges. If you take something like, let’s say, the Linux kernel, you’ll have several hundreds of thousands of nodes.
… You can do a lot by essentially solving reachability problems in these graphs.
Related resources:
“How machine learning will accelerate data management systems”
Artificial intelligence in the software engineering workflow: A 2017 AI Conference keynote by Peter Norvig
“Responsible deployment of machine learning”: Why we need to build machine learning tools to augment our machine learning engineers
“Architecting and building end-to-end streaming applications”
Data is only as valuable as the decisions it enables
...more
View all episodesView all episodes
Download on the App Store

O'Reilly Data Show PodcastBy O'Reilly Media

  • 4
  • 4
  • 4
  • 4
  • 4

4

63 ratings


More shows like O'Reilly Data Show Podcast

View all
Data Skeptic by Kyle Polich

Data Skeptic

479 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

623 Listeners

O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Radar Podcast - O'Reilly Media Podcast

35 Listeners

O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Design Podcast - O'Reilly Media Podcast

8 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

334 Listeners

Machine Learning Guide by OCDevel

Machine Learning Guide

773 Listeners

DataFramed by DataCamp

DataFramed

269 Listeners

Practical AI by Practical AI LLC

Practical AI

207 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

205 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

204 Listeners

Last Week in AI by Skynet Today

Last Week in AI

306 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

96 Listeners

MIT Technology Review Narrated by MIT Technology Review

MIT Technology Review Narrated

260 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

617 Listeners

Practical: AI & Business News by Practical News

Practical: AI & Business News

25 Listeners