
Sign up to save your podcasts
Or


Introduction
For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft's Azure documentation:
Our experimental results didn't match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.
What we found:
The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.
Some calculations
To be concrete, suppose that you are performing SFT on the following conversation:
User: blah blah blah where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is
_text{Loss} = - frac{1}{3} left( log p(text{TOKEN1}) + log p(text{TOKEN2}) + log p(text{TOKEN3}) right)_
and the accuracy is
_text{ACC} = frac{text{NUMBER OF CORRECTLY [...]
---
Outline:
(00:13) Introduction
(01:21) Some calculations
(02:14) Experiments
(03:40) Results
(05:28) Conclusions
(06:12) Acknowledgments
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongIntroduction
For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft's Azure documentation:
Our experimental results didn't match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.
What we found:
The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.
Some calculations
To be concrete, suppose that you are performing SFT on the following conversation:
User: blah blah blah where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is
_text{Loss} = - frac{1}{3} left( log p(text{TOKEN1}) + log p(text{TOKEN2}) + log p(text{TOKEN3}) right)_
and the accuracy is
_text{ACC} = frac{text{NUMBER OF CORRECTLY [...]
---
Outline:
(00:13) Introduction
(01:21) Some calculations
(02:14) Experiments
(03:40) Results
(05:28) Conclusions
(06:12) Acknowledgments
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,332 Listeners

2,452 Listeners

8,579 Listeners

4,183 Listeners

93 Listeners

1,598 Listeners

9,932 Listeners

95 Listeners

501 Listeners

5,518 Listeners

15,938 Listeners

546 Listeners

131 Listeners

93 Listeners

467 Listeners