My Hovercraft Is Full of Eels and My AAC Device Full of N-grams


A hovercraft with or without eels

Back in my college days – that’s 1977 to 1982 for those who like historical perspective – a friend of mine was taking East European studies with a view, I think, to improving his chances of joining the Socialist Workers Party. Although it wasn’t actually obligatory to speak any of the languages from Communist Europe, he clearly felt it might help. And come the day of the Glorious Revolution, when the Working Class of England would cast off their Capitalist shackles and take control of the means of production to become part of the global Socialist world, he’d be one of the intellectual elite who would help the under-educated proletariat rise to power. Sadly for him, the down-trodden workers decided to vote for Margaret Thatcher and usher in a new age of Capitalism where owning the means of production meant buying shares in British Telecom, British Aerospace, British Gas, and a host of companies that they already owned as tax payers! This was Mrs. T’s version of Clause IV socialism [1].

And that is why I happened upon a Czechoslovakian phrasebook.

I have to admit that my brief flirtation with radical socialism was fueled at that time by the fact that the local Labour club served subsidized beer, and another of my friends who worked behind the campus bar would serve Russian vodka as doubles or triples while still charging for a single. Hardly a rock upon which to build a firmly held political perspective but unlike my socialist buddy, I wasn’t at college to change the world – I was there to get a degree in Psychology and Linguistics so I could become a professor with a job for life [2].

Like many foreign language phrasebooks, it contained many “useful sentences” that one could simply trot out in the appropriate situation. Although I no longer have the book itself – and can’t for the life in me remember the title – I did keep the following short list of examples:

Can this be invisibly mended?
I have broken this denture.
How high is that mountain?
The clutch engages too quickly.
To whom does this concrete sports pavilion belong?

The latter, if memory serves me correctly, had an example answer along the lines of “It belongs to the people of the glorious Czech Republic.”

There’s actually a name to describe these types of sentences; postilion sentences. This was coined by the UK linguist David Crystal in a 1995 article where he talked about sentences used in teaching English as a Second Language:

A postilion sentence is one which has little or no chance of ever being useful in real-life. It could be used, obviously, because it is grammatically well-formed; but the contexts in which it would be natural to use it are either so restricted or so adult that the chances of a child encountering it, or finding it necessary to use it, are remote. In short, it is uncommunicative. It conveys a structural meaning, and a lexical content, but that it is all.

Crystal refers to a sentence from a early 20th century Hungarian-English phrasebook that went “The postilion has been struck by lightning.” It’s not perhaps a coincidence that the British Monty Python’s Flying Circus comedy group came up with a skit called “The Dirty Hungarian Phrasebook video,” where Hungarian phrases were translated into obscene, or simply ridiculous, English phrases, one of which has taken on a life of its own; “My hovercraft is full of eels.” It has become such a popular example of a postilion sentence that the linguists at the Omniglot website have devoted a page to provide over translations in over 130 languages from Afrikkans (“My skeertuig is vol palings“) to Zulu (“Umkhumbi wami ugcwele ngenyoka zemanzini“). So should you ever find yourself needing to explain the fishy condition of your water-skimming vehicle while vacationing in Iceland (“Svifnökkvinn minn er fullur af álum“) remember to bookmark that page!

Postilion on the Queen's carriage

The royal postilion

In fairness to phrasebook creators, creating lists and lists of sentences can appear to be a reasonable goal. After all, should you find yourself in the middle of a crowded street in some foreign land with the need to scream “That organ grinder’s monkey has stolen my wallet,” having it written down at the tip of your fingers would clearly be of benefit [4]. Similarly if you’re out on a dark and stormy night in Transylvania and your postilion does indeed suffer a lighting-related injury, you’d also be covered (“A légpárnás hajóm tele van angolnákkal“).

The limitation – and it’s a pretty big one – is that it is impossible to predict all the sentences that a traveler could potentially need. The best you can do is create a selection of fairly generic sentences that can be used across situations, such as “I like that” or “That’s not what I wanted” or “Excuse me but I need some help.” Now, if you put your lexical and statistical hats on, ask yourself why “That’s not what I wanted” seems like a better choice than “My hovercraft is full of eels.” If you said it because the former seems to be a more probable sentence than the latter, then you’re definitely on the right track. When you consider strings of words, one way of analyzing them is in terms of frequency of use, and words like that, not, want, what, my, and is, are far more frequently used than hovercraft, eels, monkey, and postilion [5].

If  you wanted to perform a simple test at a bar, a much underrated and underused experimental venue, write down the following cloze sentence [6] and ask as many folks as possible to fill in the blanks:

My <blank> is full of <blank>

In truth, I have no idea what you’ll get in response, although glass and beer may well score higher than most nouns, but the  chances you’ll get hovercraft and eels is very low. What I can predict is that the missing words will be nouns because when you look at sub-strings of words, the inherent rules of how the English language behaves start to bias our choices. In computational and corpus linguistics, folks talk about such string as n-grams, where n is the number of words in the string.

The n-gram [my <blank> is] is a trigram, and words my and is limit the words that could fit into the blank. In fact, if we look at the bigram of [my <blank>], even that excludes certain choices. This is because when we use a possessive adjective such as my, the probability is that the word to follow will be a noun. If it’s the trigram of [my is], that probability actually goes up. For example, we can find examples of the bigram [my ] as follows:

[My dog]. (my + NOUN)
[My old] dog. (my + ADJECTIVE)
[My very] old dog. (my + ADVERB)

As you see, the bigram [my ] doesn’t have to be a noun if it’s part of a longer string. But if we contrast that with the trigram [my is] then we are much more limited:

[My dog is] hungry. (my + NOUN + is)
[My *old is] hungry.
[My *very is] hungry.

For those of us who work in the field of augmentative and alternative communication (AAC) we’re actually more familiar with the science of n-grams than we might have realized because this is essentially how word prediction works. Outside of AAC, anyone who uses a mobile phone will have seen next-word prediction and not necessarily worked out that it’s based on algorithms that use n-grams to estimate the most likely next word.

Of course, probabilities are simply that; probabilities. Given a word or n-gram as a starting point, we can make good guesses as to what word or words may come next but you can never be 100% sure. A number of AAC vocabulary sets have a feature whereby if you bring up the n-gram [SUBJECT PRONOUN + TO BE] a selection of verbs appear that are all in the progressing form i.e. VERB+ING. This is based on the thinking that whenever you say something like “I am…” or “he is…” or “we are…” any following verb is likely to be along the lines of eating, drinking, running, finishing etc. But that’s a probability only –  I might want to say “I am finished” or “He is done” or even “I am really thinking about…” or “we are certainly not wanting…” where the verb is actually in the ED form or there are other words (typically adverbial) before the following verb. If I want to say “I am doing something” then having doing appear automatically after “I am” can save keystrokes; but if I want to say “I am done,” I have to delete the word doing then find done as a word on its own, which adds keystrokes and takes more time.

Designing AAC systems to take advantage of n-grams is not a bad idea. Back in the 1990s when I was working with the team that developed the Unity symbol-based language program for devices built by the Prentke Romich Company, we included a number of bigrams and trigrams based on the thinking that phrases such as “I like” and “do you want” or “she doesn’t feel” have frequencies that are comparable to individual words and actually much higher than the vast majority of nouns. At the time, we didn’t have the resources to check the figures but nowadays it’s pretty easier to do that with online corpora. A phrase such as “do you want” has a frequency score or 11126 in the Corpus of Contemporary American English (COCA), which is way above words like postilion (9), hovercraft (109), eels (464), and even lightning (6724). Another example is “I don’t like,” which comes in at 5282 but when you look for “I don’t like ,” the frequencies drop dramatically:

I don’t like (5282)
I don’t like it (682)
I don’t like this (211)
I don’t like being (128)

What you see is that in general, as the length of the string increases, the frequency drops, to the point that “I don’t like eels” and “I don’t like hovercrafts” score a big fat zero. It’s only those bigrams and trigrams that seem to have frequencies that make them practical within an AAC vocabulary set.

You can now probably work out why sentence-based AAC systems are not only impossible to design but unlikely to be of use. Sentences are in effect simply n-grams with a large n value. “My hovercraft is full of eels” and “My postilion has been struck by lightning” are a 6-gram and a 7-gram respectively, and because probability is cumulative (the sum of the probability of each word) you can imagine how stunningly low the frequencies can be for sentences. Word-based systems, supplemented with high-frequency bigrams and trigrams provide access to vocabulary sets that are flexible and practical. Having the individual words eels, full, hovercraft, is, my, of as building blocks from which to construct novel sentences when turns out to be much better than having thousand upon thousands of prefab sentences stored “just in case.”

[1] The phrase “Clause Four Socialism” came from the fourth clause in the UK Labour Party constitution of 1918, which read; “To secure for the workers by hand or by brain the full fruits of their industry and the most equitable distribution thereof that may be possible upon the basis of the common ownership of the means of production, distribution and exchange, and the best obtainable system of popular administration and control of each industry or service.” Although it sounds like it was written by a lawyer and has more embedded clauses than a convention of Santas, it formed the basis for the Socialist ticket of Britain in the 1970s, where the country came as close to being a satellite of the USSR as it had ever been.

[2] That turned out to be yet another dream unfulfilled with my life taking a very different path that kept me well out of the world of academia. But you’re not here to read about me so go back to the article and keep reading 😉

[3] Crystal, D. (1995). Postilion sentences. Child Language Teaching and Therapy, 11(1), 79-90.

[4] Technophiles will point out that the better way to do this is to shout “That organ grinder’s monkey has stolen my wallet” into their smart phone with translation facilities. That may be true but even machine translation can get a little iffy at times, and there’s a good chance that if the aforementioned simian is smart enough to target your wallet, it’s probably going to snatch your iPhone too. No-one reads books any more – not even monkeys – so your pocket phrasebook would be safe.

[5] I suppose now is a good time to add a little bit about postilions for those who are curious. On horse-drawn carriages, the postilion is a person who sits on the leading left-hand horse and who can guide the carriage if there isn’t an actual coachman on the carriage itself. The word derives from the French postillon meaning “the person who rides the post horse,” and the post horse was the one reserved for a mail carrier who would use it to take letters from one location to another. The earlier Middle French noun poste referred to “Any of a series of men stationed at suitable places along appointed post-roads, the duty of each being to ride with, or forward speedily to the next stage, the monarch’s (and later also other) letters and dispatches, and to provide fresh horses for express messengers riding through.” (OED).

[6] A cloze sentence is one where words are purposely left out so that readers can add appropriate choices. It’s a standard tool for research and education, especially when teaching literacy. The word is simply a shortened version of the word closure, hence the pronunciation of /kləʊz/ and not /kləʊs/. It’s not a “close” sentence but one that needs “closure!” It was first noted in 1953 so is relatively new.


5 responses to “My Hovercraft Is Full of Eels and My AAC Device Full of N-grams

  1. hi nice post thanks

    regarding your example –
    I don’t like (5282)
    I don’t like it (682)
    I don’t like this (211)
    I don’t like being (128)

    is “i don’t like” a typical phrase?

    we can investigate this by looking at COCA for i, i don’t:
    1 i 4810712 22.197819
    2 I don’t 210940 17.68647317
    3 I don’t like 5,282 12.36686859
    (the last column is log to the base 2 of the frequency)

    compare the above with figures from google:
    1 i 25,270,000,000 34.55670661
    2 i don’t 1,210,000,000 30.1723599
    3 I don’t like 391,000,000 28.54259337

    line plots show that COCA line is much steeper than the google line, this could mean according to COCA “i don’t like” is not really a phrase unit compared to what google shows

    we assume that a more tightly bounded phrase would show less of a drop off as each word is added

    this is after Shei, C. (2008). Discovering the hidden treasure on the Internet: using Google to uncover the veil of phraseology. Computer Assisted Language Learning, 21(1), 67-85.

    i.e. maybe a bigger corpus would help?


    • Thanks for the analysis, Mura – and the reference, which I’ll be sure to grab a hold of before I leave work tonight! I suppose it’s a sign of the times that the COCA is now a “small sample” of 450 million words 😉 If I were doing a more detailed analysis of those phrases you’re right to suggest I try something larger. I’d be looking at the frequencies for all PRON + (NOT) + LIKE, so “I don’t like” would be one among a table that includes “I/you/it/he/she/we/they + (don’t) + like/likes.” If I’d chosen the BE/HAVE/DO verbs as an example that might have been better. My hope is that what folks take away is that when designing a vocabulary set for clients who require augmentative systems that it’s worth considering the frequency of “common” (and that’s always a discussion) bigrams and trigrams along with the frequency of single words. It’s not uncommon for folks to see “teddy,” “banana,” and “sneakers” as somehow more important than “I have” or “does he have,” typically because (a) the former are easier to represent as pictures and (b) intuitions about word frequencies can tend towards only considering individual words and not phrases.

      Russell (Dude 1)

      • hi Russell i was wondering if you know of any work where augmentative systems have been used for language learning?

        p.s. if you want a copy of that paper let me know

  2. An edifying and entertaining post as always, but I take exception to you referring to probability as “cumulative (the sum of the probability of each word)”. Probability is instead multiplicative (the product of the probability of each word). If I roll a standard six-sided die, there is a one in six chance of a single pip appearing on top when it comes to rest (1:6, 1/6, 0.1666…, or ~17%). The chance that if I roll two dice that each one shows a single pip after coming to rest is the product of each die’s individual probability (1:36, 1/36, 0.02777…, or ~3%).

  3. Mark, you’re right to take exception and point out the error! I’m afraid I did indeed use the word “cumulative” in a rather fast-and-loose manner, which is a shameful thing to do considering that we Dudes try to promote accurate use of words. The probabilities are multiplied, not summed – as is my guilt at making the mistake! Your correction is duly acknowledged and I thank you for keeping me on my metaphorical toes 😉

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s