February 02, 2023

AF - More findings on maximal data dimension by Marius Hobbhahn

17 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More findings on maximal data dimension, published by Marius Hobbhahn on February 2, 2023 on The AI Alignment Forum.

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

I’d like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback.

The post contains a lot of figures, so the suggested length is deceiving. Code can be found in this colab notebook.

This is the second in a series of N posts on trying to understand memorization in NNs.

Executive summary

I look at a variety of settings and experiments to better understand memorization in toy models. My primary motivation is to increase our general understanding of NNs but I also suspect that understanding memorization better might increase our ability to detect backdoors/trojans. This post specifically focuses on measuring memorization with the maximal data dimensionality metric.

In a comment to the “Superposition, Memorization and double descent” paper, Chris Olah introduces maximal data dimensionality D, a metric that supposedly tells to which degree a network memorized a datapoint compared to using features that are shared between datapoints. I extend the research on this metric with the following findings

In the double descent setting, the metric describes exactly what we would predict, i.e. with few inputs the network memorizes all datapoints and with a lot of input it learns some features.

On MNIST, I can reproduce the shape of the D curve and also the findings that memorized datapoints have high D, datapoints that share many features are in the middle and datapoints that the network is confused about have low D. However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D spectrum. I would have expected them to all have low D didn’t learn them.

When we train the network to different levels of accuracy, we find that the distribution of errors is actually slightly left-heavy instead of right-heavy. I have not yet understood why it is the case but I’d be interest in follow-up research to see whether it tells us something interesting.

Different classes are not evenly distributed across the spectrum, e.g. “8” is far more regular than “5” according to D. This is what we would expect.

Across different hidden sizes, the shape of the D curve stays nearly the same but the spearman rank correlation between the datapoints increases the larger the difference in hidden size. This means the more similar the number of neurons, the more similar is the in which D sorts the datapoints.

Networks of the same size trained on the same data with different seeds show nearly identical D curves and have high spearman rank correlation. This is what we would expect.

Different dataset sizes produce different shapes of D, e.g. larger datasets have more shared features (they are flatter in the middle). This seems plausible.

Different levels of weight decay have nearly no effect on the shape of D. The minor effect they have is the opposite of what I would have expected.

The shape of D changes very little between initialization and the final training run. This was unexpected and I have no good explanation for this phenomenon yet. When we measure D over different batches we find the same phenomenon.

Working with D can be a bit tricky (see Appendix for practical tips). The more I played around with D, the more I’m convinced that it tells us something interesting. Particularly the question about misclassifications and error rates and the unexpectedly small change during initialization and final training run seem like they could tell us something about NNs that we don’t yet know.

Maximal data dimensionality

There are two models u...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

February 02, 2023

AF - More findings on maximal data dimension by Marius Hobbhahn

17 minutes

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

I’d like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback.

The post contains a lot of figures, so the suggested length is deceiving. Code can be found in this colab notebook.

This is the second in a series of N posts on trying to understand memorization in NNs.

Executive summary

In the double descent setting, the metric describes exactly what we would predict, i.e. with few inputs the network memorizes all datapoints and with a lot of input it learns some features.

Different classes are not evenly distributed across the spectrum, e.g. “8” is far more regular than “5” according to D. This is what we would expect.

Networks of the same size trained on the same data with different seeds show nearly identical D curves and have high spearman rank correlation. This is what we would expect.

Different dataset sizes produce different shapes of D, e.g. larger datasets have more shared features (they are flatter in the middle). This seems plausible.

Different levels of weight decay have nearly no effect on the shape of D. The minor effect they have is the opposite of what I would have expected.

Maximal data dimensionality

There are two models u...

...more

Share AF - More findings on maximal data dimension by Marius Hobbhahn

Sign up to save your podcasts

AF - More findings on maximal data dimension by Marius Hobbhahn

AF - More findings on maximal data dimension by Marius Hobbhahn