July 22, 2023

Is Morse really built around the most popular letters in English?

6 minutes

Foundations of Amateur Radio

Thanks to several high profile races we already know that sending Morse is faster than SMS. Recently I started digging into the underpinnings of Morse code to answer the question, "Can you send Morse faster than binary encoded ASCII?" Both ASCII, the American Standard Code for Information Interchange and Morse are techniques to encode information for electronic transmission. One is built for humans, the other for computers.

To answer the question, which is faster, I set out to investigate. I'm using the 2009 ITU or International Telecommunications Union standard Morse for this.

Morse is said to be optimised for sending messages in English. In Morse the letter "e", represented by "dit" is the quickest to send, the next is the letter "t", "dah", followed by "i", dit-dit, "a", dit-dah, "n", dah-dit, and "m", dah-dah.

The underlying idea is that communication speed is increased by making the most common letter the fastest to send and so-on. Using a computer this is simple to test. I counted the letters of almost 400,000 words of my podcast and discovered that "e" is indeed the most common letter, the letter "t" is next, then "a", "o", and "i". Note that I said "letter". The most common character in my podcast is the "space", which in Morse takes seven dits to send.

Also note that the Morse top-5 is "etian", the letter "o" is 14th on the list in terms of speed. In my podcast it's the fourth most popular letter, mind you, my name is "Onno", so you might think that is skewing the data.

Not so much.

If I use the combined works of Shakespeare, and given that it represents an older and less technical use of language, and doesn't feature my name, I figured it might have a different result. The top-5 in his words are "etoai", the letter "o" is the third most popular, and "space" still leads the charge, by nearly 3 times.

I also had access to a listing of 850 job advertisements, yes, still looking, and the character distribution top-5 is "eotin", the letter "o" is the second most popular letter.

Because I can, and I'm well, me, I converted the ITU Morse Code standard to text and counted the characters there too. The top-5 letters are "etion", but the full stop is a third more popular than the letter "e", mind you that might be because the people at the ITU still need to learn how to use a computer, seriously, storing documents inside the "Program Files" directory under the ITU_Admin user, what were you thinking? I digress. The "space" is still on top, nearly six times as common as the letter "e".

As an aside, it's interesting to note that you cannot actually transmit the ITU Morse standard using standard Morse, since the document contains square brackets, a multiplication symbol, asterisks, a copyright symbol, percent signs, em-dashes, and both opening and closing quotation marks, none of which exist as valid symbols.

Back to Morse. The definition has other peculiarities. For example the open parenthesis takes less time to send than the closing one, but you would think that they are equally common, given that they come in pairs. If you look at numbers, "5" takes the least amount to send, "0" the longest. In my podcast text "0" is a third more common than "1" and "9" is the least common. In Shakespeare, "9" is the most common, "8" the least, and in job listings, "0" and "2" go head-to-head, and both are four times as common as the number "7" which is the least common.

All this to say that character distribution is clearly not consistent across different texts and Morse is built around more than the popularity of letters of the alphabet. For example, the difference between the left and right parenthesis is a dah at the end. If you know one of the characters, you know the other. The numerical digits follow a logical progression from all dits to all dahs between "0" and "9". In other words, the code appears to be designed with humans in mind.

There are other idiosyncrasies. Most of the code builds in sequences, but there are gaps. If you visualise Morse as a tree, the letter "e" has two children, both starting with a dit, one followed by another dit, or dit-dit, the letter "i", and the other, followed by a dah, dit-dah, the letter "a". Similarly, the letter "t", a dah, has two children dah-dit, "n" and dah-dah, "m". This sequence can be built for many definitions, but not all. The letter "o", dah-dah-dah, has no direct children. There's no dah-dah-dah-dit or dah-dah-dah-dah sequence in Morse. The letter "u", dit-dit-dah has one child "f", dit-dit-dah-dit, but the combination dit-dit-dah-dah is not valid Morse.

It's those missing combinations that led me to believe that Morse isn't as efficient as it could be and what originally led me to investigate the underpinnings of this language.

I think it's fair to conclude at this point that Morse isn't strictly optimised for English, or if it is, a very small subset of the language. It has several eccentricities, not unlike the most popular computer keyboard layout, QWERTY, which wasn't laid out for humans or speed typing, rather the opposite, it was to slow a typist down to prevent keys from getting in each other's way when there was still a mechanical arm punching a letter into a page.

In other words, Morse code has a history.

Now I'm off to start throwing some CPU cycles at the real question. Is Morse code faster than binary encoded ASCII?

I'm Onno VK6FLAB

...more

View all episodes

By Onno (VK6FLAB)

4.6

2929 ratings