Tag Archives: BNC

The State of the Union Address 2015: “We Are Family…”

Within seconds of a President turning off the autocue, political pundits stop trembling, wipe the drool from their lips, and spend the next 2 years talking incessantly about what was said. A single speech that clocks in at just under 6,500 words can single-handedly generate more web pages than the callipygian [1] Kim Kardashian can generate page clicks. Being a dude, you might think that this post is now about to become an excuse to share a picture of the ample Ms. Kardashian’s gluteus  maximus in all it’s shiny glory – but you’d be wrong! What I’m actually more interested in doing is taking a more detailed look at the vocabulary that Barack Obama used from the basis of corpus linguistics and concordance software. At this point, 90% of the guys who found this post by googling “Kim Kardashian’s ass” will leave. Sorry, dudes.

The data came from a transcript available from Time.com, which I then used as input for WordSmith 6.0 software, a corpus analysis tool. Of the many things this software will let analyze, the ones we’ll look at here are word frequencies, keywords, and concordances.

Keywords are those words that appear in a sample as being used significantly more or less than they are typically used in the general population. In the case of WordSmith, the “general population” is a list know as the British National Corpus, a sample of some 100 million words used in British English (BrE).

The “teachable moment” here is to think about why I chose this sample. Now I know – because I have a ear for these things – that Barack Obama does not use British English; his accent is also a bit of a giveaway. However, for the purpose of this analysis, I don’t think the frequency differences between BrE and American English (AmE) are significant enough to warrant worrying about it. I could have used a different sample called the American National Corpus but that’s only good for 14 million words, which is much smaller than the BNC. Therefore, I chose to go for the larger corpus, knowing that there may be some variations between the two but not, in my opinion, enough to skew the analysis.

Top 25 words by frequency

Fig 1: Top 25 words by frequency

If we take a look at the most frequently used words in the speech, you’ll see that they are pretty much what you might expect on the basis of typical distributions. The word the is the most frequent in the English language and seeing it atop the President’s list is uninteresting. What is interesting is that the pronouns we and our are right up there above I and you. Pronouns regularly score high on frequency lists, and it’s one of the reasons practitioners in the field of Augmentative and Alternative Communication (AAC) should make sure these words are targeted. But the fact that we and our appear so high up the list (at #4 and #8 respectively) made me wonder; is this what we might expect to see in general? And that, my friends, is why we turn to a keyness list.

Top 25 words by keyness

Fig 2: Top 25 words by keyness

Take a look at that keyness column and notice how both we and our are way up there at #2 and #3. Ignoring for now the intricacies of how those keyness figures are calculated [2], what is significant is that the Pres is using those two pronouns significantly more than how anyone else would use them in general, and that reflects a conscious effort to come across as one of “us” and not an “I” or “me” who is doing things. He’s appealing to a “Spirit of Unity.”

You can see more evidence for this appeal if you simply look at the keyness of # 4 and #8 – America and Americans. He’s certainly using the words with more frequency than you’d find in a regular sample but we can perform one more kind of analysis in order to see just how he’s using them; and that’s to create a concordance.

A concordance is a list that shows instances of a word in context, along with the words that go before and after it. Below is a concordance for the word Americans as used alongside our:

Concordance of instances of the words americans and we

Fig 3: Concordance showing WE and AMERICANS

Given that there were 19 instances of the word Americans being used in total, this pairing accounts for over 30% of the use of Americans and we. So as well as using the pronouns themselves to paint a picture of unity, he’s yoking one of them with Americans to further that underlying message.

Casting your eyes just a few more lines down the keyword list you’ll see the words jobs and the economy coming in at #11 and #12, not too far above families (#14) and childcare (#16). Here we see Obama invoking notions of family and economics, both of which are important to voters because we are all involved at some level with both! But take a look at the concordance for how the word family is being used and see if you can spot some familiar words:

Concordance of the word FAMILIES

Fig 4: Concordance of the word FAMILIES

Notice how our and American are also used along with families, further reinforcing that Spirit of Unity. In fact, Obama even makes that relationship between families and the United States in the following few sentences:

“It is amazing,” Rebekah wrote, “what you can bounce back from when you have to…we are a strong, tight-knit family who has made it through some very, very hard times.” We are a strong, tight-knit family who has made it through some very, very hard times. America, Rebekah and Ben’s story is our story.

So not only do we hear this explicit appeal to family but by analyzing the words he uses throughout the speech using keywords and concordances, we can tease out those subliminal nods and pointers toward an underlying message: We are family [3].

Notes
[1] Callipygian is one of my favorite words and, like many of them, deserves to be used much more than it is. The Oxford English Dictionary defines the word as, “of, pertaining to, or having well-shaped or finely developed buttocks,” which in turn comes from the Greek words kalli meaning “beauty” and pygi meaning “buttocks or rump.” Incidentally, an old word for someone who engages in anal intercourse is a pygist, and the adjective dasypygal means “having hairy buttocks.” Try using the last one next time you want to insult folks – especially if they’re making asses of themselves!

[2] So for that one person out there who has less of a life than I have, you basically count the number of times your target word occurs out of a sample of X words in total, then match that against the number of times the same word occurs in your reference corpus of Y words in total. Here’s the word we in a little 2 x 2 box:
Measure of usage of the word WEBecause I always prefer an easy life when it comes to all things numerical, I used an online calculator to take these figures to calculate a “log-likelihood” figure – the “keyness” number. You can find that site here: http://sigil.collocations.de/wizard.html

When the site works its magic, you see the score expressed as G-Squared below:

SOTUA2015 LogLiklihood
Take a look at that G-Squared figure and then look back at the Fig 4 and you’ll see the keyness figure is (almost) the same. You can try this with any of the value in Fig 4 and you’ll see that the online calculator scores match those of the WordSmith software.

[3] It was the end of the 70s and tight spandex leggings were all the rage – for the ladies – and Sister Sledge had a monster hit with “We Are Family” from the album of the same name. Apparently the Sisters are still touring to this very day – although I’m not sure if they’re still wearing spandex.

The Dudes Dissect “Closing the Gap” 2013: Day 2 – Of Speech and Sessions

Having looked at the vocabulary used in the Closing the Gap 2013 preconference sessions, it’s time to cast a lexical eye on the over 200 regular presentations that took place over two-and-a-half days. For most attendees, these are the “bread and butter” of the conference and choosing which to attend is a skill in of itself. It’s not uncommon [1] to have over ten sessions run concurrently, which means you’re only getting to attend a tenth of the conference!

So let’s take a look at the vocabulary used in the titles to all theses presentations to get a flavor of the topics on offer.

Conference Presentations: Titles

The total number of different words used in the session titles was 629 after adjusting for the top 50 words used in English [2]. As a minor deviation, kudos to all who used the word use correctly instead of the irritatingly misused utilize. Only one titled included utilizes – and it was used incorrectly; the rest got it right! For those who are unsure about use versus utilize, the simple rule is to use use and forget about utilize. The less simple rule is to remember that utilize means “to use something in a way in which it was never intended.” So, you use a pencil for drawing while you utilize it for removing wax from your ear; you use an iPad to run an application while you utilize it as a chopping board for vegetables; and you use a hammer to pound nails but utilize it to remove teeth. Diversion over.

Top 20 Most Frequent Words in Titles

Top 20 Most Frequent Words in Titles

Top 20 Most Frequent Words in Titles

No prizes for guessing that the hot topic is using iPad technology in AAC. Your best bet for a 10-word title for next year’s conference is;

How your students  use/access iPad AAC apps as assistive technology

This includes the top 10 of those top 20 words so your chances of getting accepted are high.

Conference Presentations: Content Words

The total word count for the session descriptions text is 2,532 different words (excluding the Stop List), which is a sizable number to play with. And when I say “different words,” I mean that I am basically counting any text string that is different from another as a “word.” So I count use, uses, used, and using as four words, and iPad and iPads as two. A more structured analysis would take such groups and count them as one “item” – or what we call a LEMMA. We’d then have a lemma of <USE> to represent all the different forms of use, which lets us treat use/used/uses/using as one “word” that changes its form depending on the environment in which it is sitting [3]

Top 50 Words By Frequency in Session Content

Top 50 Words By Frequency

A 2,3oo-word graphic would be rather large so I opted to illustrate the top 50 most frequently used words. As you can see, the top words seem to be the same as those in the titles, which suggests that on balance, presenters have done a good job overall in summarizing their presentation contents when creating their titles – something that is actually the strategy you should use.

Keywords in Content

Finally, let’s take a look at the keywords in the session content descriptions. Remember, the keywords are those that appear in a piece of text with a frequency much higher than you would expect in relation to the norm.

Top 10 words by Keyness score

Top 20 words by Keyness score

Top of our list here are apps with the iPad coming in at three. Fortunately this fetish for technology is tempered by the inclusion in our top 20 of words like strategies, learn, how, and skills, all critical parts of developing success in AAC that are extra to the machinery. It’s good to think that folks are remembering that how we teach the use of tools is far, far more important than obsessing over the tools themselves.

Coming next… The Dudes Dissect Closing the Gap: Day 3 – Of Content and Commerce. In which the Dudes look at the marketing blurbs of the Closing the Gap exhibitors to discover what the “hot button” words intended to make you want to buy!

Notes
[1] WordPress’s spell and grammar checker flagged the phrase “it’s not uncommon” as a double negative and told me that I should change it because, “Two negatives in a sentence cancel each other out. Sadly, this fact is not always obvious to your reader. Try rewriting your sentence to emphasize the positive.” Well, although I generally agree that you shouldn’t use no double negatives, the phrase “not uncommon” felt to me to be perfectly OK and not at all unusual. I therefore took a look at the Corpus of Contemporary American English and found that “it’s not uncommon” occurs 313 times while “it’s common” scores 392. This is as near to 50/50 as you get so I suggest to the nice people at WordPress that “it’s not uncommon” is actually quite common and thus quite acceptable – despite it being a technical double negative.

[2] For the curious among you, here are the contents of the Stop List I have been using, which is based on the top 50 most frequently used words in the British National Corpus (BNC): THE, OF, AND, TO, A, IN, THAT, IS, IT, FOR, WAS, ON, I, WITH, AS, BE, HE, YOU, AT, BY, ARE, THIS, HAVE, BUT, NOT, FROM, HAD, HIS, THEY, OR, WHICH, AN, SHE, WERE, HER, ONE, WE, THERE, ALL, BEEN, THEIR, IF, HAS, WILL, SO, NO, WOULD, WHAT, UP, CAN. This is pretty much the same as the top 50 for the Corpus of Contemporary American English, except that the latter includes the words about, do, and said instead of the BNC’s one, so, and their. Statistically, this isn’t significant so I suggest you don’t go losing any sleep over it.

[3] When you create and use lemmas, you also have to take into account that words can have multiple meanings and cross boundaries. In the example of use/used/uses/using, clearly we’re talking about a verb. But when we talk about a user and several users, we are now talking about nouns. So, we don’t have one lemma <USE> for use/used/user/users/uses/using but two lemmas <use(v)> and <use(n)> to mark this difference. It gets even more complicated when you have strings such as lights, which can be a verb in “He lights candles at Christmas” but a noun in “He turns on the lights when it’s dark.” When you do a corpus analysis of text strings, these sort of things are a bugger!

Baths and Showers: “Taking” or “Having”?

In the 3rd century BCE, the philosopher Archimedes was taking a long bath and playing with his rubber duck. To be honest, it may not have been a rubber duck but he was dunking something in and out of the water because according to legend, he leapt out, ran down the street, and shouted “Eureka!” which is Greek for “I’ve found it!” [1] What he’d found was a method of finding out how to decide if a gold crown was actually made of gold without melting it down, which you can do by dropping it in water and measuring the amount of liquid that gets displaced. This became known as Archimedes’ Principle but sadly he neglected to trademark the phrase or sell the slogan on togas so he failed to make a fortune from this well-known piece of intellectual property.

Archimedes shouts Eureka

Eureka!

Having ideas in the bath is something with which most people are familiar. There’s clearly something about being submerged in warm water that gets the brain a-buzzing, doubtless supported by a slew of research studies that talk about expanded arteries, endorphins, and brain scans.

So this morning while I was in the shower, I got to thinking about how I actually talked about the process of showering i.e. did I say “I’m going to take a shower” or “I’m going to have a shower.” Now before you read any further, think about which of those two sentences sounds “right” to you.

If you’re American, I’m predicting you use “take” whereas if you’re British, I’m going to say you said “have.” If you’re Canadian, Australian, or a New Zealander, I’d be happy to hear from you because I’m less sure – but if I had to take a guess, I think you’re a “haver” not a “taker.”

The reason I can be so confident is that I checked out the incidence of the use of the verbs have and take in relation to bathing and showering using the British National Corpus (BNC) and the Corpus of Contemporary American (COCA). I’ve mentioned these corpora before and I encourage you again to think about using them to help make decisions about real world language usage. [2]

All I did was to search for the phrases “take a bath/shower” and “have a bath/shower” in each corpus and use a simple percentage score to create the following table:

have versus take as verb with bath

"have" versus "take"

Feel free to perform a Chi-square analysis on this if you want but the figures look significant enough without whipping out the calculator. Notice that the have/take skew is much more pronounced for American English than British English but even the latter is pretty big.

Because I work primarily in AAC, I use this sort of information about language use in the real world for developing systems. And such data also critical for teaching communication strategies. It’s not enough to simply aim to teach words as individual items because words exist within the context of other words, and those relationships are critical to understanding. For example, given the data I’ve just demonstrated, teaching the word bath along with take would make perfect sense if I’m working in the US but back in the UK, I’d be better served focusing on using have with bath.

Knowledge of word collocation can be tremendously useful when creating intervention plans, and tools such as the COCA and BNC do this. Staying with the word bath, I did a collocation search for the words that appear immediately before and after it. The words hot and bubble are the top two that go before bath, with water appearing both before and after in almost equal amounts. With this sort of collocation information, I can be confident in teaching the words hot, bubble, and water along with bath, which not only adds new words to my client’s lexicon but also provides real contextual information about how the word bath is used.

 For more about the COCA and BNC corpora – and others – go to Mark Davies’ corpus.byu.edu site and explore the interface. It’s a wonderful resource and much underused by speech pathologists methinks.

Notes
[1] The Greek word εὑρίσκω means “I find” and εὑρηκα is the perfect form meaning “I have found.” Greek declensions aside, Archimedes was clearly pretty excited about something.

[2] I’m aware that the COCA and BNC differ in relation to when they were created; the BNC data is from 1980-1993 whereas the COCA is more current with data from 1990-2011. However, given that this is a known variable, it’s still reasonable to make comparisons.