Tag Archives: Concordance

The State of the Union Address 2015: “We Are Family…”

Within seconds of a President turning off the autocue, political pundits stop trembling, wipe the drool from their lips, and spend the next 2 years talking incessantly about what was said. A single speech that clocks in at just under 6,500 words can single-handedly generate more web pages than the callipygian [1] Kim Kardashian can generate page clicks. Being a dude, you might think that this post is now about to become an excuse to share a picture of the ample Ms. Kardashian’s gluteus  maximus in all it’s shiny glory – but you’d be wrong! What I’m actually more interested in doing is taking a more detailed look at the vocabulary that Barack Obama used from the basis of corpus linguistics and concordance software. At this point, 90% of the guys who found this post by googling “Kim Kardashian’s ass” will leave. Sorry, dudes.

The data came from a transcript available from Time.com, which I then used as input for WordSmith 6.0 software, a corpus analysis tool. Of the many things this software will let analyze, the ones we’ll look at here are word frequencies, keywords, and concordances.

Keywords are those words that appear in a sample as being used significantly more or less than they are typically used in the general population. In the case of WordSmith, the “general population” is a list know as the British National Corpus, a sample of some 100 million words used in British English (BrE).

The “teachable moment” here is to think about why I chose this sample. Now I know – because I have a ear for these things – that Barack Obama does not use British English; his accent is also a bit of a giveaway. However, for the purpose of this analysis, I don’t think the frequency differences between BrE and American English (AmE) are significant enough to warrant worrying about it. I could have used a different sample called the American National Corpus but that’s only good for 14 million words, which is much smaller than the BNC. Therefore, I chose to go for the larger corpus, knowing that there may be some variations between the two but not, in my opinion, enough to skew the analysis.

Top 25 words by frequency

Fig 1: Top 25 words by frequency

If we take a look at the most frequently used words in the speech, you’ll see that they are pretty much what you might expect on the basis of typical distributions. The word the is the most frequent in the English language and seeing it atop the President’s list is uninteresting. What is interesting is that the pronouns we and our are right up there above I and you. Pronouns regularly score high on frequency lists, and it’s one of the reasons practitioners in the field of Augmentative and Alternative Communication (AAC) should make sure these words are targeted. But the fact that we and our appear so high up the list (at #4 and #8 respectively) made me wonder; is this what we might expect to see in general? And that, my friends, is why we turn to a keyness list.

Top 25 words by keyness

Fig 2: Top 25 words by keyness

Take a look at that keyness column and notice how both we and our are way up there at #2 and #3. Ignoring for now the intricacies of how those keyness figures are calculated [2], what is significant is that the Pres is using those two pronouns significantly more than how anyone else would use them in general, and that reflects a conscious effort to come across as one of “us” and not an “I” or “me” who is doing things. He’s appealing to a “Spirit of Unity.”

You can see more evidence for this appeal if you simply look at the keyness of # 4 and #8 – America and Americans. He’s certainly using the words with more frequency than you’d find in a regular sample but we can perform one more kind of analysis in order to see just how he’s using them; and that’s to create a concordance.

A concordance is a list that shows instances of a word in context, along with the words that go before and after it. Below is a concordance for the word Americans as used alongside our:

Concordance of instances of the words americans and we

Fig 3: Concordance showing WE and AMERICANS

Given that there were 19 instances of the word Americans being used in total, this pairing accounts for over 30% of the use of Americans and we. So as well as using the pronouns themselves to paint a picture of unity, he’s yoking one of them with Americans to further that underlying message.

Casting your eyes just a few more lines down the keyword list you’ll see the words jobs and the economy coming in at #11 and #12, not too far above families (#14) and childcare (#16). Here we see Obama invoking notions of family and economics, both of which are important to voters because we are all involved at some level with both! But take a look at the concordance for how the word family is being used and see if you can spot some familiar words:

Concordance of the word FAMILIES

Fig 4: Concordance of the word FAMILIES

Notice how our and American are also used along with families, further reinforcing that Spirit of Unity. In fact, Obama even makes that relationship between families and the United States in the following few sentences:

“It is amazing,” Rebekah wrote, “what you can bounce back from when you have to…we are a strong, tight-knit family who has made it through some very, very hard times.” We are a strong, tight-knit family who has made it through some very, very hard times. America, Rebekah and Ben’s story is our story.

So not only do we hear this explicit appeal to family but by analyzing the words he uses throughout the speech using keywords and concordances, we can tease out those subliminal nods and pointers toward an underlying message: We are family [3].

Notes
[1] Callipygian is one of my favorite words and, like many of them, deserves to be used much more than it is. The Oxford English Dictionary defines the word as, “of, pertaining to, or having well-shaped or finely developed buttocks,” which in turn comes from the Greek words kalli meaning “beauty” and pygi meaning “buttocks or rump.” Incidentally, an old word for someone who engages in anal intercourse is a pygist, and the adjective dasypygal means “having hairy buttocks.” Try using the last one next time you want to insult folks – especially if they’re making asses of themselves!

[2] So for that one person out there who has less of a life than I have, you basically count the number of times your target word occurs out of a sample of X words in total, then match that against the number of times the same word occurs in your reference corpus of Y words in total. Here’s the word we in a little 2 x 2 box:
Measure of usage of the word WEBecause I always prefer an easy life when it comes to all things numerical, I used an online calculator to take these figures to calculate a “log-likelihood” figure – the “keyness” number. You can find that site here: http://sigil.collocations.de/wizard.html

When the site works its magic, you see the score expressed as G-Squared below:

SOTUA2015 LogLiklihood
Take a look at that G-Squared figure and then look back at the Fig 4 and you’ll see the keyness figure is (almost) the same. You can try this with any of the value in Fig 4 and you’ll see that the online calculator scores match those of the WordSmith software.

[3] It was the end of the 70s and tight spandex leggings were all the rage – for the ladies – and Sister Sledge had a monster hit with “We Are Family” from the album of the same name. Apparently the Sisters are still touring to this very day – although I’m not sure if they’re still wearing spandex.

The Dudes Do ASHA 2012: Day 4 11/17/12

It was the last day of ASHA and I had the special honor of closing the AAC strand for the convention. In short, I was last on the list of AAC presenters. In a curious twist of fate, my colleague from Germany opened the AAC strand at the first session of Thursday so between us we’d bracketed the field!

ASHA at Georgia World Conference Center

ASHA at GWCC

A less charitable viewpoint might be that I had to present after lunch on the last day, when many folks were leaving to catch planes home or taking the opportunity to spend one last day in the wonderful city of Atlanta. So the fact that folks turned up, including one of my #slpeeps from the Twitterverse was quite a relief. [2]

The topic was on how to use the data generated by an AAC device to plan therapy sessions. A number of AAC technologies have the facility to track data but few people seem to use it. The purpose of the presentation was to show folks that there is immense value in using such logging in order to help clients improve their communication skills.

Basically, automated data logging tracks events over time; you can see what someone is saying and when they are saying it. And with just these two pieces of information, you can provide a much better service to your clients. [1] You can gather information about;

  • Vocabulary – the words your client uses
  • Morphology – the way your client uses morphemes to indicate tense, number, intensity etc.
  • Syntax – how your client uses words in a systemic way along with other words
  • Function – how is your client’s language used (questions, imperatives, requests etc.)

To facilitate this, you can use the QUAD Profile, a paper-based checklist that provides guidelines on what to look for. Developed in 2005 as a quick and dirty evaluation tool, the QUAD is simple enough that you don’t have to be a specialist in AAC to use it [3]. You can click on the graphic below to download a copy.

Download the QUAD Profile

QUAD Profile

You can also take user-generated text data and analyze it using either Concordance or WordSmith, two pieces of software that you can input large amounts of text and then measure word frequencies, type/token ratios, or find keywords – those words in a sample that occur more frequently than you would expect by chance. I’ve covered both these – and discussed core versus fringe versus keywords in The Dudes Do ISAAC 2012: Day 4 – Of Corpora and Concordances, so take a look there for more details.

What I failed to spend any time talking about was the excellent BYU Corpora created by Mark Davies at Brigham Young University. If you’re wanting to find out how a particular word is used in contemporary American English – or slightly less contemporary British English – you can do no worse than using these corpora than the Corpus of Contemporary American English, or COCA [4]. As an example, I previously talked about the difference between “taking a bath/shower” and “having a bath/shower,” arguing that in British English you’d teach “having” whereas in American English you’d focus on “taking.” The key point is that you can use the COCA to quantify this difference. And quantifying is a step towards evidence-based practice.

Here’s another example of where using the COCA can help you decide on which words to teach: which should you teach first – look or see? Well, if you want to focus on bigger semantic bang-for-buck, you should go for see, which is used in speech twice as often as look. Or how about need and want? It turns out that want is three times more likely to be used than need, so want is much more useful.

Another thing the COCA does is to show how words are used in context. This turns out to be very valuable knowledge to have when teaching language because you can’t just teach a word in isolation. For example, if we go back to the example of the word look, the COCA shows that is very frequently appears immediately before a preposition. Here are specifics:

the word look with prepositions

“look” and PREP

So if you are going to teach look, think about look at followed by look for as contextual phrases because that’s how the word is used in real life! Here’s a link to download my slides and notes as a PDF handout.

DOWNLOAD: Using AAC device-generated client data to develop therapy sessions

By 2:30, I was done. My target was to be in my room at 3:00 with my shoes off, feet up, and a coffee in my hand. And this turned out to be a success!

At 5:00, I left for an early dinner with friend at the Sweet Georgia’s Juke Joint at 200 Peachtree Street. Being in the South, I plumped for fried chicken with collard greens, a peach cobbler for dessert, and a delicious Millionaires Mojito. To make the night breeze along, we were entertained by Nat George and the Nat George Players, a band so smooth you could spread ’em on toast.

The video doesn’t really do the band justice but that’s all the more reason for you to put a trip to see them on your list of “Things to do in Atlanta” on you next trip out.

Another memorable night, and yet another example of why I guess I could spend much more time exploring the city. But tomorrow it’s back home. Ah well. C’est la vie.

Downtown Atlanta

Downtown Atlanta

Notes
[1] At this point, you might wonder why I don’t leap into the discussion about privacy, security, and ethics. Well, that’s because if I need to do that, I’d rather spend an entire post on it. But the short answer is that in all the years I’ve worked with clients who have data logging capabilities I have yet to have ONE tell me that I can’t see their data. After I have a short conversation about why I want to track their data and what I intend to do with it, they’ve been happy to allow me to have access. It’s important to have this discussion prior to turning on monitoring, and critical to explain the value, but once you do that, there’s no problem. Informed consent is a wonderful thing.

[2] OK, so it was @MeganPanatier – Thanks for stopping by and for tweeting some of my comments during the presentation!

[3] Cross, R.T. (2010). Developing Evidence-Based Clinical Resources, in Embedding Evidence-Based Practice in Speech and Language Therapy: International Examples (eds H. Roddam and J. Skeat), John Wiley & Sons, Ltd., Chichester, UK.

[4] The site  includes the Corpus of Contemporary American English (450 million words), the British National Corpus (100 million words), the Corpus of Historical American (400 million words), the Time Magazine corpus (100 million words) and the new Corpus of American Soap Operas (100 million words), which I have yet to test run!

The Dudes Do ISAAC 2012: Day 4 – Of Corpora and Concordances

Pittsburgh from Station Square

Pittsburgh from Station Square

Marketing applies as much to conference presentations as it does to selling beans. Or coffee. Or bagels[1]. Picking a good title is more important than the presentation itself. Really, it is. Which explains why my “first-thing-in-the-morning” session was not exactly standing room only. The presentation title was the technically accurate but marketingly disasterous Using Concordance Software and online Corpora in AAC. A much better title would have been Using Velcro and a Free iPad for New Simple Gamechanging Therapy. You see, this has all the buzz words that people scan for when reading a conference program. Free is always a winner; iPad is currently sexy; new suggests you will be surprised and maybe first to do something in your part of the world; Velcro® is something ALL therapists relate to; and game-changer is an over-used, over-hyped, almost meaningless vogue word that can be applied to anything in order to make it sound impressive. People who use the word “game-changer” should be hung, drawn, quartered, and made to read a thesaurus.

But I did, fortunately, hear from a number of folks who told me they wanted to come to the presentation but it clashed with another. It clashed, in fact, with several! So given a finite number of potential attendees divided by the number of sessions, as concurrency goes up, individual session attendence goes down. Therefore for those who were unable to attend, I can at least give you a brief summary of what I was talking about. And for those folks who couldn’t make it to ISAAC 2012 in the first place, I’m also including this link to my PowerPoint files and Resources List via the Dudes’ Dropbox account.

The first thing I covered was the difference between core, fringe, and keyword vocabulary. In AAC, the use of core and fringe is now fairly common but we need to make another distinction for something called keyword vocabulary. Here’s how these three words can be defined:

Core word: A word that has a high frequency of use value that is statistically expected when compared to a large reference corpus.
Fringe word: A word that has a low frequency of use value that is statistically expected when compared to a large reference corpus.
Keyword: A word that has a higher frequency of use which is significantly more frequent than expected when compared to a large reference corpus.

Notice that these definitions do not include any notion of “importance.” A common mistake is for people to say things like, “but Tommy loves Transformers so ‘Optimus Prime’ is an important core word for him.” No, “Optimus Prime” is a keyword for him. It may seem like a trivial distinction but it is useful.  Sure, it may be an “important” word for him” but that still does not make it core. Thus, when people talk about a “personal core” for an individual, they are really talking about a person’s keyword set. It is much better to use this because talking about a “personal core” seems to me to be confusing and changes the definition of core.

The notion of keywords has been taken straight from the field of Corpus Linguistics:

Keywords are words which are significantly more frequent in a sample of text than would be expected, given their frequency in a large general reference corpus. (Stubbs, 2010) [2]

Corpus Linguistics uses large data samples, or corpora, to look for patterns in language. The larger the samples are, the more reflective the data is of “real world” language use. One of the largest online sets of such corpora is those developed and maintained by Mark Davies at Brigham Young University in Utah. The Corpus of Contemporary American English [3] is based on a sample of 425 million words, and can provide frequency data of individual items, as well as contextual information on how these are typically used. This type of data can be useful for the AAC practitioner in determiner which words to include in a system and to answer questions about how a word may be used (e.g. is the work light used more as a noun than a verb?)

Another tool used by corpus linguists is concordance software. Such software allows investigators to input text and create output in the form of frequency lists, key word lists, and key words in context. The AAC practitioner can use client-generated data and run it through concordance software to build personal vocabulary lists. It’s also possible to compare a client’s data with other samples, which can also be very instructive for a clinician who wants to see how an individual’s use of language matches with a “standard.”

Concordance software

Concordance

Concordance is a flexible text analysis program which lets you gain better insight into e-texts and analyze language objectively and in depth. It lets you count words, make word lists, word frequency lists, and indexes.

You can select and sort words in many ways, search for phrases, do proximity searches, sample words, and do regular expression searches. You can also see statistics on your text, including word types, tokens, and percentages, type/token ratios, character and sentence counts and a word length chart.

Wordsmith concordance softwareWordSmith is a popular word-analysis software that includes features to generate word lists, frequency lists, usage lists, and keyword lists.

It also has the option to download the British National Corpus word frequency list to use as a large comparative data set. This is a great tool for investigating keywords among small data sets.

Now, a number of commercial devices have this data-logging feature included as an option, providing a record of events over time. With the client’s consent, being able to track such usage can be invaluable in helping clinicians and educators see exactly what the client is currently capable of doing and, by extension, create teaching plans that will develop their ability to use the device. But if you are prepared to clean the raw data from an AAC device up a little, you can drop it into a concordance software and works some magic. You can see how a client’s use of vocabulary matches what you might expect; you can discover a clients keyword vocabulary by filtering out core words; and you can look at how client’s use vocabulary in context e.g. where do they use the word light and how is it being used.

In summary, what I’m suggesting is that using (a) large online corpora and (b) concordance software can enhance the way on which we develop and expand AAC systems, and that both of these are based on actual usage of language and not some hypothetical construct of what we think is happening with vocabulary.

Enough of the academic stuff; I just want to alert you to an unmissable experience at Tonic Bar & Grill on the corner at 971 Liberty Avenue, just outside the David L. Lawrence Convention Center. Those of a nervous gastronomic disposition may want to stop reading now – as may folks who are on any diet other than the “Let’s See How Fat I Can Get Before My Arteries Explode” diet.

At any time of day, you should prop yourself up at the bar, order one of their small selection of draught beers, and place an order for Poutine Fries [4]. This is a heavenly bowl of hot potato fries, smoothered in slippery, creamy cheese, and topped with a generous helping of tender braised short ribs. You can choose to experience this ambrosial feast either by eating it or having a cardiologist smear it directly on to your arteries: we recommend the former. How we managed to eat just one bowl is still a mystery to us but our hearts will undoubtedly thank us.

Poutine fries

Poutine Fries

Notes
[1] As it was an early presentation, I skipped breakfast, which meant that by the time I’d finished I was hungry. So a shout out to the good folks at Bruegger’s Bagels on Grant Street in downtown Pittsburgh who supplied me with their Breakfast Bagel, a mouth-watering treat of egg, cheese, and bacon on a crusty whole-wheat bagel. I’m pretty sure it’s not the healthiest of starts to the day but it sure is one of the tastiest.

[2] Stubbs, M. (2010). Three concepts of keywords. In M. Bondi and M. Scott (Eds.) Keyness in Texts: Studies in Corpus Linguistics. John Benjamins Publishing: Philadelphia.

[3] Davies, M. (2008-) The Corpus of Contemporary American English: 425 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.

[4] As our Canadian friend will know, Poutine Fries originated in Quebec and therefore represent a form of biological warfare against America, the intent being to bring the country to its knees by making everyone too fat to get up off of them. Rest assured that on their next trip to Montreal, the Dudes will make sure they take advantage of sampling the local Poutine Fries and would encourage anyone taking a trip to Canada to do the same!