Getting one’s teeth into a corpus


Guy Aston

University of Bologna



December 11th. This morning I have my weekly session with a group of Italian undergraduates taking what, for many of them, may be the last English language course they ever attend. After this, they’ll be on their own, as - I hope - autonomous learners who know how to take care of themselves. I'm late as usual, as I arrive clutching a pile of photocopies, my laptop and a portable data projector, and I find them all sitting finishing the remains of a large panettone Christmas cake. It will take me a couple of minutes to set the computer and projector up, so I wave the photocopies at them with the words “When you've all finished eating, here's something else for you to get your teeth into”. One of them puts down her cake to scribble a note. They start looking at the photocopy while I get the hardware switched on and start up the two programmes I expect to be using - the Oxford English Dictionary on CD-ROM, and the SARA interface to the world edition of the British National Corpus.1


The photocopy is an article from The Guardian, reporting on a scandal of a few days earlier - the discovery that the English versions of a number of pages on the Italian government’s official website were machine translations of the Italian originals. (Many of these students hope to become professional interpreters and translators, so this should reassure them that machine translation is not yet a serious threat to their employment prospects.)  The article starts:


Lost in translation

Allegations that the Italian government has been economical with the truth about comical translation blunders on an official website should come as no surprise, writes Philip Willan


Silvio Berlusconi has promised to dust the cobwebs from Italy's sclerotic state bureaucracy. A gifted communicator, successful businessman and master of modern technologies, to many electors he appeared just the man for the job. So it was particularly embarrassing when it emerged last week that the presentation of his government to the world had been entrusted to a hamfisted translating machine.


I ask them to just look at the headlines and first paragraph, and to underline any words with which they are unfamiliar. Then we draw up a list on the board: blunders, cobwebs, and hamfisted. They have been able to guess the meaning of the first - “mistakes” - from context. They have also deduced that dust the cobwebs from here means “modernise”, even if they have no idea of the literal meaning of the expression. For hamfisted they have no clear idea at all - though one of them suggests it might mean “automatic”.


We start off by looking up blunder in the OED. I get a student to come and do this on my laptop, and the rest of us comment on how she does it and watch the results. Given their previous guess, she has no problem in deciding that we are interested in the second sense of the noun blunder, namely


A gross mistake; an error due to stupidity or carelessness.

The words of Talleyrand as to the murder of the Duc d'Enghien---‘ces paroles stoïquement politiques, “C'est plus qu'un crime, c'est une faute”’ (Lucien Bonaparte Mem. an. 1804 (1882) I. 432) have been englished, ‘It is worse than a crime, it is a blunder,’ and are often quoted or alluded to.


So the dictionary confirms that a blunder is a stupid mistake. But - apart from the Talleyrand example, which no-one in the class recognises - it tells us little about the actors and actions involved in blunders. So we turn to the BNC, which I ask another student to come and interrogate.


The SARA index tells us there are 197 occurrences of the word-form blunder in the BNC - a large enough number to make it interesting for these advanced learners, who, I have told them, should consider any word occurring more than once every million words of running text (i.e. more than 100 times in the 100-million word BNC) to be “worth learning”. This is a very rough rule of thumb: for instance, if a word only occurs in a particular type of texts, it may be less - or more - important, depending on whether that type of texts is important to them. For multi-word phrases, the “worth learning” threshold, as we shall see, needs to be set lower.


You can't easily look at 197 concordance lines on a data display, so we just download a random 20:



Looking at these, it seems that the kinds of people who make blunders are doctors, managers, politicians and footballers - along with the institutions they work for (hospitals, companies, government departments, football teams ...).


We then change the display to show a larger amount of context for each occurrence, and go through some of these citations one at a time. While doing so, we check where each one comes from. All seem to be from written texts, and many from newspaper reports - often in headlines, students note.2 So we do a couple of quick follow-up queries to check these impressions. First we look for blunder just in newspapers (of which there are just under 10 million words in the BNC), finding 111 occurrences (23 of them in headlines). Then we look just in the 10-million word spoken component of the BNC, where there is only one occurrence - from a radio news broadcast. As there are 197 occurrences in the entire 100-million word corpus, we would expect to find about 20 occurrences in any 10-million word part of it: but 111 is a lot more than 20, and 1 a lot less. This confirms that blunder is very much a “written” word, in particular a “newspaper” one.


Going back to the original concordance, I ask each member of the class to choose a line which they like, and to then write it down with two or three sentences of context as an example of the use of blunder. The most popular is:


He shook his head at his blunder. ‘Oh shit!’ he said loudly and then quietly: ‘Look what you've made me do.’


A good example, I think, of the sort of thing you can often find in a corpus but rarely in a dictionary (would even Sinclair’s Cobuild venture the definition A blunder is the sort of mistake which makes you shake your head and say “Oh shit!” loudly ... ?). There is some amusement at the idea of Berlusconi uttering an Italian equivalent on learning about his government's mistake - and at what the machine translation of that obscenity might be ...


Hastily switching from the role of facilitator to that of censor, I move on to our next word, cobweb. The OED tells us its literal and figurative denotation:


The web or fine network spun by a spider for the capture of its prey; also, the substance.


a. Anything of flimsy, frail, or unsubstantial texture; esp. fanciful fine-spun reasoning.

b. Any musty accumulation, accretion, or obstruction, which ought to be swept away, like dusty cobwebs in a room.


This definition contains a number of strange words: these students guess prey, can get a rough idea of flimsy and frail from unsubstantial, but are flummoxed by musty. I point out that its meaning is pretty well explained in the definition: “like dusty cobwebs in a room”, and tell them to write down all these words, which they might like to look up for themselves (they have free access to the OED and the BNC in the computer laboratory). That done, we turn again to the BNC.


The SARA word index tells us there are 173 occurrences of the noun cobweb (50 singular and 123 plural), but when we look at a random sample of 20, all but one turns out to involve the literal rather than the metaphorical meaning. What to do? One student suggests looking for dust away the cobwebs, which could be a recurrent idiomatic expression. After a brief discussion of the various forms this might take (involving separation of the particle, passivisation, adjectival/adverbial modification), we decide to look for cases where forms of the noun cobweb and of the verb dust co-occur within a space of five words. This finds five occurrences, all of which do indeed suggest that there is an idiom dust away/off the cobwebs. At this point I upset things by warning that maybe other verbs could be used as alternatives to dust, so we redo the query, this time looking for co-occurrences of cobwebs with away or off, again within a space of five words. There are 20 of these, an ideal number to project! All seem metaphorical, and we find that as well as dusting, you can shake the cobwebs off, or blow or clear or sweep them away - the most common form being blow away. Overall, the expression clearly isn’t very frequent (even if the frequencies of phrasal expressions always tend to be lower than those of single words), so rather than wasting time going through all the citations, I quickly select three of the simplest examples and ask students to write down blow away the cobwebs, along with one example as a memory prompt:


The first is that spring is in the air, so it's time to dust away the winter cobwebs and treat yourself to a new look!


Second Karl Böhm's magnificent performance with the Vienna Philharmonic Orchestra. On LP this never sounded particularly impressive but the transfer to CD on DG ‘Galleria’ has blown away the sonic cobwebs to reveal a blazing treasure and a worthy comparison for his incandescent Bruckner 4 with the same orchestra (Decca).


Streamline has swept away the cobwebs from the English class. With most courses even the keenest students are put off by what is artificial or repetitive, dry or academic. Streamline proves that English learning can be realistic and varied, useful and fun.


We’re not using Streamline in our faculty, so we can’t verify the truth of this claim. After a glance up at the ceiling (currently fairly free from cobwebs, since it collapsed a couple of months ago), it’s back to Berlusconi.


The last unknown word on the list is hamfisted. What can it mean? It is clearly separable into two components (ham + fisted), but my students can see no way in which their meanings fit together. The OED doesn't give this form, but one of the good things about its search software is that it also finds any hyphenated matches for the string you type, and it comes up with the hyphenated ham-fisted, meaning “having large or clumsy hands; heavy-handed, awkward, bungling”. Here again, the dictionary introduces some new words (heavy-handed and bungling), both of which appear to be synonyms of awkward. More stuff for them to look up for themselves, I suggest.


Forewarned is forearmed. When we look up hamfisted in the BNC, we use a pattern which will match both the hyphenated and unhyphenated forms, finding that there are 19 of the former, and only 5 of the latter. So the OED seems to be right to give the hyphenated version. With a total of only 24 occurrences, the issue is clearly not one of learning the word, but simply of understanding what it means in the context of the article we are reading. Downloading all 24 solutions, we find that several - like the Berlusconi instance - concern politicians and governments:


The large defence cuts that Labour proposes would be ruinous to job prospects, and its ham-fisted intervention plans would not work.


The supporters of O M O V have put their arguments in what can only be said to be a ham-fisted and insulting way. They may have been intending to talk about reform of the Labour Party constitution, but what ordinary trade unionists heard was senior members of the Party talking as if they were ashamed of the trade union connection.


These examples make it clear that ham-fisted implies the writer’s negative evaluation of the  behaviour being described, which is judged offensive or ruinous - good enough to understand what is meant in our text by a hamfisted translation machine.


Having finished with our list of unfamiliar words, this is a good point for questions. One student asks about the OED entry for blunder, where we found Talleyrand's C'est plus qu'un crime, c'est une faute. The dictionary stated that this was “englished” as It is worse than a crime, it is a blunder. She is intrigued by the use of english as a verb. We quickly look this up in the BNC index, finding 24 occurrences. So while rare, it does exist - even if, as all these occurrences come from only 6 texts, it seems pretty idiosyncratic. I suggest she tries generating and analysing a concordance of the verb english for homework.


There are also a number of other features in the Berlusconi text which have aroused curiosity. The first is the headline expression “has been economical with the truth”. None of the class have encountered this previously, even if they guess that it means not telling the whole truth. A search in the BNC for economical followed by truth within 5 words finds 16 occurrences. Glancing through them, we see that several refer to a certain Sir Robert Armstrong:


When the British civil servant Sir Robert Armstrong struggled in the Spycatcher trial to deny that he had lied by claiming that he had been ‘economical with the truth’ he met with well-earned derision. There was an almost audible sigh of relief around the world that someone had been caught tampering with the dictionary.


But I was not surprised when, several years later, I read that Sir Robert Armstrong, appearing for the government in the Spycatcher trial in Sydney, had admitted to cross-examining counsel that he had been what he called ‘economical with the truth’.


Armstrong's patrician Oxbridge performance was not very successful in the brash atmosphere of the Australian courts - significantly less so than his performances before House of Commons committees. In a near-fatal aperçu, Armstrong's comment on the civil servant's need to be ‘economical with the truth’ seemed to set the seal on a growing process of encroachment on individual freedoms by a strong-minded, centralist government. Meanwhile, travellers abroad found no difficulty in buying Spycatcher in airport bookstalls.


These three citations all use the expression economical with the truth in inverted commas, thereby suggesting a specific, and quite famous, quotation. I suggest that as homework, a couple of them try to find out more about the trial at which Armstrong used these words, by reading more extensively in these three texts. They might also try looking for (and at) other references to Armstrong or to Spycatcher in the BNC:3 a corpus can often be a good place to find information about the world, not just about the language.


We then turn to another expression which some of them have noted as curious - the phrase man for the job. Like most dictionaries, the OED can be very difficult to consult where common words are involved (there are 50 screenfuls of information on the noun man), so we go directly to the BNC. We find 47 different occurrences of this phrase, and a glance at a random 20 immediately shows recurrent patterns: the word best or right often precedes man. I ask if they see any other patterns: yes, the phrase is often followed by a full stop, i.e. it is in sentence-final position. In order to get more information about these collocational and colligational regularities, we redo the query and this time download all 47 solutions - too many to look at all of them carefully,  but at the moment I want them to concentrate on numbers. Sorting them by the left reveals that there are 11 occurrences with best (and one with no better), 15 with right, and 16 with the directly preceding man (including 4 with just the, as in the Berlusconi text). Like blow away the cobwebs, this looks like an idiomatic expression with a few standard variants. Sorting the solutions by the right shows that around half are followed by a full stop, and I ask the class to look more closely at the ones that aren't. Several are followed by closing quotes and a reporting verb, several more are followed by and - all-in-all there is a clear tendency for this expression to appear in final position in the clause, when not in the sentence.


But what does it mean? Returning to our random set, we start looking at the situations in which man for the job is used. As with economical with the truth, we find that several occurrences refer to one particular event - or at any rate one particular problem - namely, the manager of the England football team, and whether he is the best/right man for the job. Along with other sporting allusions, we also find references to the suitability of a number of politicians for office. One student points out that there generally seems to be doubt about the matter: when we are told that X is the man for the job, it implies that there are good reasons to believe they are not - as also seems to be the case in our Berlusconi text.


FAIR TACKLE|  When Souness was first appointed, I applauded the move. But he has hardly had a trouble-free run since then with most of his problems self-inflicted. Only time will tell if he is the right man for the job. At the moment, I'm still to be convinced.


A REFEREE by the name of A Spirin could prove a headache for Leeds and Rangers when they clash in tomorrow's climax to their European Cup battle.|  The Russian official recently so upset his domestic association that they suspended him for two months, although UEFA insist: ‘He's still one of the best referees in Europe and the right man for the job.’


Strange news: the Sun is sponsoring one of the challengers for the Labour leadership: ‘Sun man's bid to be Labour leader’. Ken Livingstone, of course, writes a column for the paper, and so gets its unstinting support: ‘We're right behind Ken Livingstone - we reckon he's just the man for the job. And to prove it, we're giving away stickers and badges so you can back him too.’ Send them a sturdy stamped addressed envelope if you want one.


Once more, they notice, a lot of occurrences seem to be from newspapers. I leave the class to check this for themselves, having seen how to do it earlier (they'll discover they're right: not only the Sun likes this expression).


Time is running out. We haven't got far with the Berlusconi article, but we've done some interesting work on its vocabulary, and they've certainly all learned something - linguistically, methodologically, and about the world. We've examined a number of problems relating to querying and interpreting corpora, and seen how various kinds of information can be obtained from counts and concordances, with as usual plenty of stimuli for further independent work. And we have also tacitly dealt with a fair number of features which are relevant to comprehension of the article - the attitude of the journalist, for example - so I feel few qualms in leaving them to finish reading it at home, and to look up and report back on other interesting features they find in it.


A further thing they can do is to look up interesting expressions found in the concordances we've been looking at. I certainly don't expect them to understand all the citations in every concordance - indeed I expect them to focus on those they do more or less understand - but it will be rare for them not to find words, usages or texts which stimulate their curiosity. For instance, in the last citation above, what does unstinting mean? Is its collocation with support a frequent one? (Answer: we also find it with help and service, so there would seem to be a tendency for it to collocate with words from a particular semantic area rather than with just support.) And what about its colligations? Is it only used attributively or also predicatively? (Answer: also predicatively. You can be unstinting in your praise of somebody or something.)


Plenty for them to do, and I make it clear that I don’t expect them to do it all, but to choose things from their lists which they find interesting, to work together if they want, and if they meet other members of the class during the week, to talk about what they have discovered and how, since their work will create lots of information gaps (and reasoning gaps) for them to bridge. I tell the student who was interested in the verb english that it would be nice if she gave a ten-minute presentation of her findings at the beginning of the next lesson - and suggest that she might extend the inquiry to other languages: is french or italian ever used as a verb? I myself don’t know the answer to this question, and she might find it out.


Finally, I use the last ten minutes to do what I always do when I have the OED and BNC available  - to look at some features of my (or their) spoken English. I always ask students to try and note down one or two things which they hear said during the lesson which strike them as strange or new, and at the end get them to say what they have written down, and we look some of these things up. I already know one of the items on today’s list: get your teeth into, which I saw one student write down at the beginning of the lesson. The OED tells us the meaning all right, and provides three brief examples:


8h. (various phrases) to get one's teeth into; to become engrossed in, to come to grips with, to begin serious work on.

1935 D. L. Sayers Gaudy Night i. 23 If one could work here steadily - getting one's teeth into something dull and durable.

1961 B. Fergusson Watery Maze vi. 140 American eagerness to get their teeth into the enemy.

1983 G. Mitchell Cold, Lone, & Still x. 111 He's not the man to let go while he's got his teeth into a suspect.


The BNC contains 37 occurrences of forms of get followed by teeth into within 5 words. A memorable(?) example comes from another ELT textbook review:


English now is a varied, informative and intelligent course. It gives students something to get their teeth into because many students enjoy, and demand, this level of challenge - and because learning a language well demands this level of enjoyment.


But, one bright student points out, like dusting away the cobwebs, this expression too may be variable, with other verbs occurring in the place of get. So we do a new query, this time for teeth into.

SARA finds 83 occurrences of teeth into in the BNC, so this looks like an expression worth learning. Preceded by possessives, there are 37 with forms of get, 32 with sink, 4 with dig, plunge or  clamp, and 1 with have. Sink, dig, plunge and clamp seem to denote literal bites (particularly by animals, or else in erotic contexts), while get and have seem to be metaphorical, and almost always human. But in quite a few cases (generally from newspapers) there seems to be deliberate ambiguity, as in:


JACKIE Wintle, pictured above, from Barlaston's planning department, has a hobby she can really get her teeth into.|  In her spare time she makes and decorates cakes of different shapes and sizes for all occasions.


A BIG star of 1993 will be Sadie Frost, 25, above, already making a big name for herself as she sinks her teeth into Dracula.


STALLONE’S girlfriend Jennifer Flavin parades the new look that Sly loves to get his teeth into.


In one broadcast news item, this strategy is repeated ad nauseam:


The British banger has taken a bit of a grilling today. Researchers from Which Magazine scoffed dozens of the things. Instead of meat they found fat, rind and gristle they tell us. The Government, though, is battling to save the sausages’ skin. It's a subject Mark Phillips has been getting his teeth into.


But the most fascinating citation, which goes a long way towards a historical explanation of those above, is the following:


Reading and eating were always metaphorically linked at the time. You get Du Bellay, in the 16th century, playing the cannibal and ‘digesting’ his literary forebears. And you get Milton proclaiming the nutritious value of books which are, for him, ‘as meats and viands’. But unlike Du Bellay, whose visceral fantasies of sinking his teeth into Homer and Virgil are frankly rather nasty, Norbrook isn't out to appropriate past authors for his own devices.


And this also explains, of course, better than I have any right to expect, my joke at the beginning of the lesson, when I proposed the Berlusconi article as something for them to get (or did I say sink?) their teeth into, as an alternative to the Christmas cake. Enough for today: I think it’s time for lunch.


* * *


Let me try to ease the reader’s digestion by listing the main pedagogic points I have tried to illustrate through this (mainly truthful) account. For theoretical justifications and further detail concerning these points, see Aston (2001).


As a resource, a corpus like the BNC can be used to complement traditional tools (dictionaries, grammars and textbooks) in the roles of:

- a reference tool to resolve particular  problems encountered in reading or listening;

- a means to study particular linguistic features for their own sake;

- a resource to browse in, letting one curiosity lead to another.


From an information perspective, a corpus can provide:

- information concerning frequency;

- information concerning distribution in different text-types;

- information concerning variants;

- information concerning collocations, colligations, and discourse functions;

- as a consequence of the above, information about meaning in the sense of how and when an expression is used;

- information about the world as well as about the language;

- multiple examples, from which learners can identify recurrent patterns and select instances which they find memorable.


From an activity perspective, a corpus can provide:

- practice in  reading;

- practice in inferring and testing generalisations about language use;

- opportunities for spoken or written communication when discussing and reporting strategies and findings.


From an autonomy perspective, a corpus can provide:

- a powerful tool for independent or collaborative learning, reducing dependence on the teacher or the native-speaker as language expert;

- a motivating resource (as I hope to here have demonstrated).






1. The world edition of the British National Corpus is available for single-installation purchase (a 20GB hard disk is recommended) for £50. A version for network installation is available at £250. The BNC can also be consulted over the Internet, with a maximum of 50 solutions with one sentence of context per query, free of charge. For full details see


2. In the screenshot above, headlines are preceded by multiple vertical bars.


3. The BNC SARA software allows the user to read any text in the corpus, starting at the beginning (or end) or at the point of a particular occurrence. For reasons of space, this facility cannot be illustrated here.






Aston, G. (ed.) (2001). Learning with corpora. Houston TX: Athelstan.







Guy Aston teaches English Linguistics at the University of Bologna in Italy. He has recently edited a volume on the use of corpora in language learning, Learning with Corpora (Athelstan, 2001), and is co-author (with Lou Burnard) of The BNC handbook (Edinburgh University Press, 1998).