WHAT CORPORA FOR ESP?

Guy Aston

Scuola Superiore di Lingue Moderne per Interpreti e Traduttori
University of Bologna at Forlì


Paper presented at a conference on ESP at the University of Pavia, November 1996


1. Corpora in language teaching and learning

By corpora, I mean collections of homogeneously-encoded computer-readable text compiled for linguistic purposes. These may be corpora which are already publicly available, or ones which are constructed ad hoc for particular purposes by the teacher or researcher. In either case, such corpora can be analysed with concordancing software to provide three main types of secondary data:

  • frequency counts of particular features in the corpus;
  • concordances listing contextualised examples, or citations, of particular features;
  • frequency counts of further features within a set of citations, for instance of collocations (lexical co-occurrences) and colligations (grammatical, semantic and pragmatic co- occurrences).
  • In language teaching contexts, such data can be used:
  • to investigate how particular language features are used, by examining concordances of them and comparing frequencies;
  • to practice using language by interpreting citations.
  • The first of these pedagogic uses of corpora has a long-established tradition in ESP. Well before the advent of the computer, researchers had created and analysed corpora manually as an aid to syllabus and materials design. For instance, West (1953) analysed the most common words and word-senses in 5 million words of written text, with the aim of facilitating the selection and grading of vocabulary for EFL purposes, as well as for the writing of popular science and technology. The growth of ESP in universities in the 1970s and 1980s coincided with the spread of personal computers and of optical character recognition and concordancing software, allowing researchers to create and analyse their own corpora of specific domains with far greater ease. For instance, Skehan (1981) wrote a simple programme to analyse a corpus of economics texts with the aim of identifying the proportions of general, sub-technical, and specialised lexis. He concluded that "if our aim is to put [ESP] students into a position from which they can understand 80% to 90% of the total running words in texts, the emphasis in vocabulary teaching must be on sub- technical words", such as rate and process (Skehan 1981: 117). Flowerdew (1993) directly derived the lexicogrammatical content of an EAP course from an analysis of a corpus of biology textbooks. Several studies have compared collections of authentic texts with ones of published teaching materials, well summed up in the title of a paper by Ljung (1991), "Swedish TEFL meets reality". In recent years, corpus-linguistic studies of register have provided empirical justifications for such approaches:

    The markedly different patterns of linguistic form and function that occur across registers indicate that there is no single set of linguistic features that should be emphasized for all students, once they have mastered the rudiments of English grammar. Rather, it is important to teach the linguistic characteristics and functions of particular target registers, so that students will be able to control the language structures they encounter in actual discourse and to adjust their language use appropriately for different registers.
    (Biber et al 1994: 174)

    While these studies underline the function of corpora in defining the contents of ESP work, seeing them as instruments for the teacher and materials writer, a similar perspective can also be adopted from the learner's viewpoint. By investigating corpus data, learners can discover how language is used in the domain they are concerned with in a process of "data- driven learning", drawing their own inferences about the meanings and uses of particular features (Johns 1991, 1994). Aston (1995, 1996) underlines how such activity can be learner- rather than teacher-directed, with appropriate corpora providing a self-access resource from which learners can derive information for themselves, complementing that available from dictionaries, grammars, and encyclopaedias. Figure 1 shows a concordance of the word jaundice taken from a collection of research papers in hepatology (FHC: see below). From it, ESP learners may be able to derive much potentially relevant information: that jaundice is something that patients can have or develop, that it may be severe, and that it is often judged a symptom of hepatitis, along with such disturbances as anorexia and nausea. They may also opt to examine a rather wider context in order to find out what "postpump" jaundice is:

    Fig. 1. Jaundice in FHC

     1 became raised and 10 patients developed jaundice. All of the nine patients who had no
     2  full-blown clinical symptoms including jaundice (1). Those who have mild fatigue oft
     3 ent, taking particular note of previous jaundice, hepatitis, blood transfusion, drug 
     4 rexia, rigors, joint pains, dark urine, jaundice). Virological studies- Markers of he
     5 d a history of cirrhosis, hepatitis, or jaundice; three studies excluded patients wit
     6 esence of one or more of the following: jaundice, loss of appetite, weight loss, or f
     7 thin two postoperative days) "postpump" jaundice not associated with raised transamin
     8 e recruitment (Table 3). No patient had jaundice, but hepatomegaly was present in 28 
     9 our weeks apart. Ten patients developed jaundice, and six of these were acutely ill (
    10  6, 11, 17) though they did not develop jaundice, had classical symptoms of acute hep
    11 of factor VIII, and all four had severe jaundice and anorexia and suffered from nause
    12 gy. Among the 10 patients who developed jaundice the incubation period was under five
    13 ncubation period was 12 weeks developed jaundice (figure). 
    14 phase of illness; approximately 15% had jaundice. None of those in whom chronic hepat
    15 . As only ten of our patients developed jaundice it is possible that the hepatitis wo
    

    Whether from the teacher's or the learner's perspective, the approaches so far described view corpora as a source of information. They focus on the product of the analysis. The knowledge derived is, in the terms of cognitive psychology, declarative: a series of facts about the language and the culture. Such knowledge may not of course be accurate: this will depend on the appropriacy of the corpus data and the ability of the user to analyse and generalise from it. But most if not all declarative knowledge involves simplification, just as interlanguage research has shown that second language acquisition effectively involves successive approximations to the target system.

    When considering the use of corpora by learners as well as by teachers and researchers, however, we should also consider a second aspect of corpus use, focussing instead on the analytic process. Work with corpora can also provide a means of developing knowledge which is procedural, in the sense of an ability to use language to carry out problem-solving operations. Brodine (forthcoming) argues that the interpretation of concordance data primarily involves "bottom-up" interpretation based on the local co-text, as opposed to the "top-down" interpretative processes based on the global context and overall text structure which are emphasised by most reading materials. In the jaundice example in Figure 1, several of the citations include a list of symptoms in a complex nominal group: to understand these citations, the reader must identify this group, analyse its internal structure, and assign it a syntactic role in the clause and sentence. Such procedures of chunking and grammatical analysis are of obvious relevance to the development of reading skills. Figure 2 illustrates how a concordance of the pronoun they can provide the learner with practice in resolving anaphoric reference:

    Fig. 2. They in FHC

    
     1 SA test systems became available that contain additional viral antigens. Experience gained so far
    with these systems has shown that they provide increased sensitivity (Craxi et al., 1991). The
    polymerase chain reaction (PCR) (Saiki et al., 1988) is an 
     2 cyrrhosis, and hepatocellular carcinoma are now accepted sequelae, their frequency their rate of
    development, and the degree to which they contribute to mortality are not yet well established, since
    current data come largely from retrospective studies. Prospectively 
     3  the 29 acute hepatitis C patients. Anticapsid antibody appeared earlier than the antinonstructural
    antibody in 10 seroconverters. They appeared simultaneously in 15 seroconverters but anticapsid
    antibody
    appeared later than the antinonstructural antibody in 3
     4  clinical symptoms including jaundice (1). Those who have mild fatigue often blame it, perhaps
    correctly, on the disease for which they received the transfusion. Very few in whom prolonged
    abnormalities in serum aminotransferase levels develop 
     5 he absence of symptoms and may become the focus of attention or of therapy (8). Disease is perceived
    by patients, however, if they have symptoms. Signs such as an abnormal ALT level or histologic
    evidence of inflammation do not themselves interfere
    

    Concordances which are carefully selected can provide learners with systematic practice in such procedural skills, while at the same time calling on declarative knowledge to facilitate these interpretative processes.

    If the use of corpora can potentially develop both declarative and procedural knowledge in the learner, a key question is what kinds of corpora are most appropriate for these purposes. This question is clearly multi-faceted, and in this paper I will only touch on a few of the issues involved. It is evident that the choice of a corpus, and of particular data from that corpus, will always primarily depend upon the nature of the specific pedagogic situation, and on the need to train learners to use corpora effectively (Gavioli, forthcoming). But there is also I think a more general issue of the kinds of corpora that are most appropriate for ESP, be it from the perspective of the teacher or researcher working in syllabus and materials design, or from that of the learner who not only needs to understand the workings of language in a particular domain but also to practice engaging with it. In this paper I consider three related aspects of this issue: the extent to which the texts contained in a corpus should be limited to the particular domain of ESP (corpus specificity); the number and length of the text samples it should contain (corpus size); and the extent to which these mirror the universe of discourse which the corpus aims to capture (corpus representativeness).

  • Specificity. If we take at face value Biber et al's claim that different registers are linguistically distinct (1994: cited above), it follows that corpora for ESP should be as specific as possible. From this perspective, we may feel that it is desirable that a corpus for use with (say) postgraduate students of hepatology should consist exclusively of hepatology research papers. Corpora of such a highly specific domain are unlikely to exist in publicly available form, however, and will probably have to be created ad hoc by the teacher/researcher. In contrast, corpora which contain a broader base of texts, from a variety of specialisations and/or from a variety of genres (textbooks, divulgative texts, etc.), are more likely to be already available.
  • Size. Specificity also militates against size. It is widely accepted that corpora should, ceteris paribus, be as large as possible. Most linguistic phenomena are very rare, and as we shall see, the larger the corpus, the better is the chance of their being adequately documented. If a researcher has to construct a specific corpus by first locating appropriate texts, and then keyboarding, scanning, or downloading them, the costs of doing so are likely to pose severe restrictions of scale.
  • Representativeness. Specificity and size limitations tend to decrease representativeness. Home-made specific corpora tend to be fairly opportunistic collections, consisting of a small number of texts which happen to be easy to assemble in electronic form. More representative sampling of a domain requires the inclusion of numerous text samples from a wide range of sources (Atkins et al 1992). Larger and more general corpora are more likely to be representative, albeit of a larger universe of discourse.
  • In the next section I shall compare three corpora which range from the relatively specific, small and opportunistic to the relatively general, large and representative, and discuss their potential relevance to the hypothetical ESP context of postgraduate hepatology students. They are:
  • The Forli hepatis-C corpus (FHC) (42,857 running words). This contains keyboarded versions of 14 recently-published hepatology research articles, taken from medical journals in the university library. It is thus highly specific, while very small and of limited representativeness.
  • The MicroConcord corpus B medical texts (MCB-M) (203,419 running words). MicroConcord corpus B contains 33 similarly-sized samples from a variety of academic books published by Oxford University Press (Murison-Bowie 1993). 7 of these are classed as medical, dealing with a variety of topics: none deals specifically with hepatology. Taken together, these 7 texts constitute a corpus which is thus larger but less specific than FHC, though the limited number and restricted provenance of the texts cast some doubt on its representativeness.
  • The British National Corpus written applied science texts (BNC-WAS) (7,369,290 running words). There are 364 texts classified as applied science in the written component of the BNC - just under 10% of the entire corpus, which aims to represent contemporary British English as a whole (Burnard 1995). Constructed using rigorous sampling techniques, the number and variety of texts mean that while less specific, the applied science section is almost certainly more representative of its domain than either of the corpora described above.

  • By comparing some of the information available from each of these corpora and the procedural work involved in their use, I hope to illustrate some of their virtues and limitations. Overall, I shall suggest, none of these corpora is particularly adequate to meet ESP requirements, and that pedagogy is likely to be best served by making use of a range of corpora of different types.

    2. Comparing corpora

    I begin with some general comparisons of the lexical content of the first two of these corpora (FHC and MCB-M), carried using the Wordlist and Keywords components of Wordsmith tools (Scott 1996). I shall then go on to make a number of comparisons for specific items with the third, much larger corpus, BNC-WAS. While lexical analysis is not the only possible pedagogic use of corpora, it is likely to be a major one, and I believe that the problems posed by a lexical comparison can largely be extended by analogy to other areas.

    There are substantial differences in the frequencies of word-forms in FHC and MCB-M. Figure 3 lists those items which are significantly more frequent in FHC:

    Fig. 3. Word-forms more frequent in FHC than MCB-M (in descending order of significance)

                                FHC               MCB-M         chi-square
    
         1  HEPATITIS           817   (1.90%)      16            3787.9 
         2  PATIENTS            741   (1.73%)     245  (0.12%)   2298.9 
         3  HCV                 301   (0.70%)       0            1427.9 
         4  TRANSFUSION         300   (0.70%)       0            1423.2 
         5  CHRONIC             238   (0.55%)      15            1032.8 
         6  ANTI                226   (0.53%)      11             999.9 
         7  LIVER               233   (0.54%)      20             980.1 
         8  NON                 309   (0.72%)      96  (0.05%)    977.4 
         9  WERE                583   (1.36%)     522  (0.26%)    966.2 
        10  ALT                 194   (0.45%)       0             917.9 
        11  STUDY               228   (0.53%)      72  (0.04%)    715.3 
        12  B                   268   (0.62%)     142  (0.07%)    655.8 
        13  BLOOD               226   (0.53%)      89  (0.04%)    646.0 
        14  CIRRHOSIS           136   (0.32%)       0             641.6 
        15  C                   212   (0.49%)      84  (0.04%)    604.0 
        16  SERUM               116   (0.27%)       4             520.5 
        17  NANB                110   (0.26%)       0             517.8 
        18  POSITIVE            136   (0.32%)      33  (0.02%)    464.8 
        19  POST                112   (0.26%)      14             444.3 
        20  DISEASE             191   (0.45%)     127  (0.06%)    401.4 
        21  BIOPSY               80   (0.19%)       0             375.0 
        22  ANTIBODY             82   (0.19%)       2             371.5 
        23  HBC                  77   (0.18%)       0             360.7 
        24  PCR                  75   (0.17%)       0             351.2 
        25  RECIPIENTS           64   (0.15%)       0             298.8 
        26  MONTHS              121   (0.28%)      63  (0.03%)    297.1 
        27  HBSAG                63   (0.15%)       0             294.1 
        28  DONORS               63   (0.15%)       0             294.1 
        29  VIRUS               112   (0.26%)      52  (0.03%)    293.0 
        30  TRANSAMINASE         58   (0.14%)       0             270.3 
        31  FOLLOW               93   (0.22%)      35  (0.02%)    268.9 
        32  LEVELS               74   (0.17%)      17             254.9 
        33  CORE                 60   (0.14%)       4             254.9 
        34  ACUTE                78   (0.18%)      22  (0.01%)    252.0 
        35  SAMPLES              67   (0.16%)      12             245.8 
        36  RNA                  57   (0.13%)       4             240.8 
        37  DEVELOPED           100   (0.23%)      54  (0.03%)    239.6 
        38  UNITS                86   (0.20%)      35  (0.02%)    239.5 
        39  TESTED               80   (0.19%)      29  (0.01%)    234.6 
        40  DONOR                51   (0.12%)       1             230.5 
        41  POSTTRANSFUSION      49   (0.11%)       0             227.4 
        42  VIII                 50   (0.12%)       1             225.7 
        43  TRANSFUSED           47   (0.11%)       0             217.9 
        44  AMINOTRANSFERASE     47   (0.11%)       0             217.9 
        45  PROSPECTIVE          48   (0.11%)       2             210.0 
        46  VIRAL                61   (0.14%)      14             209.4 
        47  HBV                  45   (0.10%)       0             208.4 
        48  SEROCONVERSION       43   (0.10%)       0             198.9 
        49  TRANSFUSIONS         42   (0.10%)       0             194.1 
        50  FACTOR               58   (0.14%)      15             191.8 
    

    Not surprisingly, FHC shows much higher frequencies of domain-specific lexis. We can note that of the first 10 significantly more frequent words, nearly all are lexical rather than functional: hepatitis, patients, HCV, transfusion, chronic, anti, liver, and ALT (the name of a measure used in testing for hepatitis). Several of these (HCV, transfusion, ALT) do not occur at all in MCB-M. The only function words in the first 10 are non and were, in 8th and 9th position respectively. These data suggest that FHC provides a much more appropriate tool than MCB-M for highlighting lexis relevant to hepatology research. This is confirmed by a comparison of wordlists sorted by frequency for the two corpora. In MCB-M, the 25 most frequent forms are all function words, while in FHC they include a number of lexical ones (hepatitis, patients, HCV, transfusion, chronic, liver, study and anti), as well as some new functional ones (were, non, had and from) (figure 4):

    Fig. 4. Most frequent words in FHC and MCB-M

    
            FHC                  Freq.      %         MCB-M             Freq.     %
    
         1  THE                2149   (5.01%)         THE               12525   (6.14%)
         2  OF                 2059   (4.80%)         OF                 7607   (3.73%)
         3  IN                 1077   (2.51%)         TO                 6116   (3.00%)
         4  AND                1012   (2.36%)         AND                5222   (2.56%)
         5  HEPATITIS           817   (1.90%)         A                  4866   (2.38%)
         6  TO                  744   (1.73%)         IN                 4499   (2.20%)
         7  PATIENTS            741   (1.73%)         IS                 3354   (1.64%)
         8  A                   681   (1.59%)         THAT               2937   (1.44%)
         9  WERE                583   (1.36%)         BE                 2245   (1.10%)
        10  WAS                 523   (1.22%)         IT                 1946   (0.95%)
        11  FOR                 487   (1.14%)         FOR                1810   (0.89%)
        12  WITH                464   (1.08%)         AS                 1618   (0.79%)
        13  NON                 309   (0.72%)         THIS               1485   (0.73%)
        14  HCV                 301   (0.70%)         WITH               1475   (0.72%)
        15  TRANSFUSION         300   (0.70%)         OR                 1453   (0.71%)
        16  OR                  280   (0.65%)         ARE                1411   (0.69%)
        17  B                   268   (0.62%)         NOT                1283   (0.63%)
        18  HAD                 267   (0.62%)         WAS                1274   (0.62%)
        19  BY                  261   (0.61%)         YOU                1258   (0.62%)
        20  THAT                241   (0.56%)         BY                 1246   (0.61%)
        21  CHRONIC             238   (0.55%)         WHICH              1084   (0.53%)
        22  LIVER               233   (0.54%)         ON                 1048   (0.51%)
        23  FROM                231   (0.54%)         HAVE                996   (0.49%)
        24  STUDY               228   (0.53%)         AN                  884   (0.43%)
        25  ANTI                226   (0.53%)         BUT                 831   (0.41%)
    

    These differences suggest a greater specialisation of FHC not only in topic but also in genre: for instance the greater presence of past tense forms may be due to the fact that the FHC texts are research papers containing detailed narratives of experimental methods and results, as opposed to the book chapters contained in MCB-M.

    This same lexical/functional difference appears from a comparison of the most frequent 4-word clusters in the two corpora, where we again find a clear contrast. While numbers of occurrences of these are, bar the first two clusters in FHC, similar in the two corpora, the percentage figures show that their relative frequency is much smaller in MCB- M (figure 5):

    Fig. 5. Most frequent 4-word clusters in FHC and MCB-M

    
     FHC                                 Freq.  %     MCB                              Freq.  %
    
     1   NON-A NON-B                     133 (0.31%)  AT THE END OF                     24 (0.01%)
     2   A NON-B HEPATITIS                99 (0.23%)  IN THE CASE OF                    23 (0.01%)
     3   OF NON-A NON                     35 (0.08%)  PRE-EXPOSURE TO THE               23 (0.01%)
     4   OF POST-TRANSFUSION HEPATITIS    28 (0.07%)  A CHANGE OF CONTEXT               22 (0.01%)
     5   HEPATITIS B SURFACE ANTIGEN      19 (0.04%)  AT THE SAME TIME                  22 (0.01%)
     6   WITH NON-A NON                   19 (0.04%)  IN THE UNITED STATES              20
     7   NON-B HEPATITIS IN               18 (0.04%)  PER CENT OF CASES                 20
     8   POST-TRANSFUSION HEPATITIS C     17 (0.04%)  THE SEXUALLY TRANSMITTED DISEASES 19
     9   RELATED TO LIVER DISEASE         17 (0.04%)  IN THE ABSENCE OF                 18
    10   A NON-B POST                     16 (0.04%)  AS A RESULT OF                    17
    11   B POST-TRANSFUSION HEPATITIS     15 (0.03%)  EXPOSURE TO THE CONTEXT           17
    12   NON-B POST-TRANSFUSION           15 (0.03%)  IS LIKELY TO BE                   17
    13   THE NUMBER OF UNITS              15 (0.03%)  ON THE BASIS OF                   17
    14   THE UPPER LIMIT OF               15 (0.03%)  THE EXTENT TO WHICH               17
    15   UPPER LIMIT OF NORMAL            15 (0.03%)  WILL BE ABLE TO                   16
    

    Being more varied, the list for MCB-M highlights items which are dispersed across texts of various kinds - understandably, mainly functional ones. While domain-specific lexis and clusters are highlighted, these more general and functional items would seem poorly documented in FHC.

    This tendency is confirmed if, rather than comparing particularly frequent items in these corpora, we examine particularly infrequent ones. Figure 6 lists some hapax legomena in each corpus, or word-forms which only occur once:

    Fig. 6. Hapax legomena beginning with the letters GO


    FHC BNC z-score MCBM BNC z-score 1 GO 8.15 1 GOA -0.01 2 GOAL 0.52 2 GODDEN -0.01 3 GOING 5.97 3 GODSEND -0.01 4 GOOD 7.27 4 GOODS 0.90 5 GORDON 0.22 5 GOOSE 0.03 6 GOT 8.39 6 GORDON 0.22 7 GORY -0.01 8 GOSIO -0.01 9 GOVERNORS 0.13 10 GOWLAND -0.01

    The FHC list contains several word-forms which are very frequent in the language as a whole, judging from their z-scores in the entire British National Corpus (these indicate the extent to which a particular word-form is more or less frequent than the mean of all such forms). This would seem to reflect not only the specificity of FHC, but also its size, which is clearly insufficient to document many general and sub-technical items. While perhaps less common in this domain than in the language as a whole, such words as go, good, or got would hardly seem to be excluded from the potential lexis of hepatology research a priori.

    Investigating the hapax legomena in a corpus provides a way of estimating its reliability for pedagogic purposes. If we count their number and divide this by the total number of word-tokens in the corpus, this tells us how often we encounter a word-form which occurs only once. This is the same as the probability of encountering a word-form not present in the corpus if we were to read a further text from the same population. For FHC, this figure is 3.3% (1435/42,857), while for MCB-M it is only 2.1% (4456/203,419). Notwithstanding its lesser specialisation, in other words, with its greater size MCB-M does in fact seem to provide greater predictive power at a lexical level for its domain.

    The overall picture that emerges from these comparisons is that FHC, by virtue of its specificity, gives more information concerning specialised lexis relating to this domain, while MCB-M gives more information concerning general and sub-technical items. The most striking example of this difference is found in the lists of 4-word clusters in the two corpora (figure 5 above), where MCB-M gives prominence to a series of conjunctive and modal expressions which act as general text-organising devices: in the case of, in the absence of, as a result of, on the basis of, is likely to be, will be able to. But similar differences emerge when we examine sub-technical items and their collocations. For instance, the word problems occurs only three times in FHC (figure 7):

    Fig. 7. Problems in FHC

    
     1 subsequently experienced significant clinical problems related to hepatic inflammation;
     2  located. Questions were designed to identify problems related to hepatic failure.
     3  known; the family was not aware of any liver problems in this patient. 
    
    While these data suggest some possible collocations - that one is aware of, experiences or identifies problems related to a particular disease - the number of examples is clearly insufficient to assess these reliably. If, on the other hand, we turn to MCB-M, we find many more occurrences of problems: 45 in all. There are no examples of problems related, but there are instead several of problems associated (figure 8):

    Fig. 8. Problems associated in MCB-M

    
     1 fat. There will also be additional health problems associated with lack of exercise. If your leg 
     2  worsen the symptoms of `jet-lag" and any problems associated with shift-work but also it impairs  
     3 have a much more open view of sex and the problems associated with it and have a less punitive ap
     4 errelated. They are: 1. Assessment of the problems associated with the overdose, of the overdose 
     5 ndition which superficially resembles the problems associated with disseminated gonococcal infect
     6  husband over a year previously. When the problems associated with the microscopic diagnosis, par
    
    This suggests that this collocation may also be appropriate in the hepatitis domain.

    The available evidence is nonetheless still insufficient to judge with any confidence whether problems are related or associated, or what differences there may be between the two. To gather yet more information concerning these collocations we can turn to the BNC applied science texts. Here we find 38 occurrences of problems associated, and only 3 of problems related (figures 9 and 10):

    Fig. 9. Problems associated in BNC-WAS

    
     1 chieved without individual and fiddling adjustment. The problems associated with the need for perfect
     2 essed over this issue is actually only addressing those problems associated with animal species. Plan
     3 nt bactericidal characteristics would solve many of the problems associated with both cleaning and di
     4  built in wet conditions. I have known the condensation problems associated with `drying out" to last
     5                                        Flat Roofs | The problems associated with felted flat roofs ar
     6  etailed geographic information to overcome some of the problems associated with linking data sets.  
     7 een developed to provide tools for the solution of many problems associated with emergency planning (
     8 aken overall the exercise had focussed attention on the problems associated with the filing systems o
     9 quality petrol during the Second World War. Many of the problems associated with hydrocarbon producti
    10 to study. Compendia of information cannot solve all the problems associated with mapping out one's li
    11 the features of company and partnership but without the problems associated with limited partnerships
    12 y as impractical. They do however perhaps highlight the problems associated with nomination, and dire
    13                                          In view of the problems associated with training and staff m
    14 ed that indistinct analysis of needs will contribute to problems associated with evaluation. If evalu
    15 esh as an appropriate technology by trying to avoid the problems associated with the pre-packaged sal
    16  level nuclear waste had admitted it has not solved the problems associated with the migration of rad
    17 water extracted from the ground. However, sometimes the problems associated with large trees can only
    18 as an aid for one person will not usually encounter the problems associated with elaborate design tea
    19 f statistical survey reveals the scale and diversity of problems associated with the mind at work. Th
    20 rounded off the day with a highly technical look at the problems associated with the issuing of flyin
    21 makes sense on logical grounds, since it gets round the problems associated with grandmother cells, l
    22 nologies. The software is said to solve the fundamental problems associated with software development
    23  short supply, and help to alleviate some of the health problems associated with living in confined a
    24 definitions, of which 88% was correct), there are still problems associated with the processing of id
    25  of current practices has shown that the main practical problems associated with these schemes did no
    26 g is to encourage a multidisciplinary discussion of the problems associated with the use of subsymbol
    27 look very attractive, because it eliminates many of the problems associated with open-loop control (m
    28  using a special keyboard having 21 keys. There are two problems associated with converting the steno
    29  encyclopaedic knowledge, semantic networks). There are problems associated with extracting semantic 
    30 ormation as possible between the two tagsets. There are problems associated with combining the two ta
    31 ound. Furthermore as the matrix becomes less sparse the problems associated with storage increase. Al
    32  organized by a Data Base Management System (DBMS). The problems associated with data collection, inp
    33  immunohistochemical technique has overcome many of the problems associated with the tritiated thymid
    34 re caused by operational mishandling and other physical problems associated with the constant mountin
    35 e transferred in one cycle. The Geneva report lists the problems associated with age, cycles attempte
    36 ught up in this debate because of the health and social problems associated with alcohol abuse - even
    37 t abuse alcohol and do not suffer from health or social problems associated with their drinking. Medi
    38  discriminate against imported products. The logistical problems associated with the scheme have also
    

    Fig. 10. Problems related in BNC-WAS

    
     1 d on the form of the emergency. Most would have tackled problems related to sheep on their own, but i
     2 made to simplify insertion and use of the machine. Yet, problems related to patients could not be mod
     3 ed to a failure to recognise Crohn's disease, technical problems related to pouch construction and il
    
    The frequency data seems here fairly conclusive.

    These data suggest that on the one hand FHC is unreliable where relatively infrequent features are concerned, and on the other that larger and more general corpora can provide appropriate data for the study of features which are less domain-specific. We cannot deduce from FHC that problems are more frequently related than associated in hepatology texts, as the numbers are too small, but from the BNC we can deduce that there seems no reason why problems associated should not also occur in such texts. By comparing results from small specific and large general corpora, an idea can be formed of the extent to which a particular feature is or is not restricted to a specific domain. The larger corpus, by indicating the generality of the expression in question, highlights patterns of use which are not apparent in the small one, but which might well be found were the latter larger.

    3. Using different corpora

    One argument in favour of the use of large, general corpora as well as small, specific ones has now been outlined. On the level of lexis, the literature bears witness to a widespread conviction that, in ESP, terminology is rarely the central learning problem. More commonly, learning problems relate to a sub-technical lexical level, which is, by definition, less specialised (Skehan 1981, Mparutsa et al 1991): one of the ways in which terminology can be automatically identified is in terms of its greater tendency to occur only in a limited range of texts (Sta 1995). Such lexis is therefore likely to be better documented in a larger, more general corpus than in a small specific one, thereby better allowing reliable declarative knowledge in this area to be inferred by the user. I now turn to a second argument, which also focusses on the procedural aspects of the question.

    It is arguably as important for ESP learners to develop strategies for dealing with the unpredicted as it is for them to be familiar with the predictable. All learners are likely to have to face situations which cannot be fully predicted linguistically, and which do not conform to the content of any given corpus. However reliable a corpus may be as a sample, its reliability in pedagogic terms will always be limited where the learner's target requirements cannot be precisely defined. ESP courses, which must gloss over individual differences in needs, are both unable to define these requirements precisely, and unable to avoid averaging them out. In any case, we have seen that any new text, even from the same domain, will almost certainly contain features unmatched within a corpus, however representative, large and specific it may be. There is no way of reducing the proportion of hapax legomena in a corpus to zero.

    In pedagogic terms, the essence of this point has been well put by Widdowson, who criticises approaches to ESP which aim to specify the syllabus in terms of the features of target registers:

    Communicative competence means the ability to enact discourse and so to exploit a knowledge of rules (usage and use) in order to arrive at a negotiated settlement. It is essentially a capacity for solving problems, not a facility for producing prepared utterances. So if we are going to specify a restricted repertoire, it should be represented as a range of problem-solving strategies, involving the contingent use of language, not a collection of items.
    (Widdowson 1984: 197-8)

    This brings us back to the second use of corpora in teaching mentioned in the first section of this paper, that of providing opportunities for learners to engage in interpreting discourse in a controlled manner: chunking text, assigning anaphoric reference, and the like. For such purposes, it can be argued that the choice of texts need not necessarily be restricted to a limited target domain, since the same kinds of problems to develop problem-solving strategies will also be found elsewhere: research is not limited to hepatology. Many such problems are in fact likely to involve interpreting functional and sub-technical items, which are, as we have seen, very differently distributed from specialised lexical ones: indeed variation in the distribution of these items may be no higher in a more general corpus than in a domain-specific one. In their analysis of scientific texts, Biber and Finegan (1994) (1994) found that for features relating to factors of informativeness and abstractness, the extent of variation across different sections of medical texts (introduction, methods, results, discussion) was almost as great as overall variation within scientific texts in general.

    The tendency for certain features to cluster in particular parts of texts can be seen as a matter of functional specialisation which cuts across domains. For instance, of 68 occurrences of the word our in FHC, no less than 60 occur in discussion sections, 22 of these in the collocation our study (figure 11):

    Fig. 11. Our in discussion sections of FHC

    
     1 able nonviral causes of transaminase elevation by our clinical review committee may also have played
     2 view had advanced disease-a finding that supports our conclusion. Recent reports have more clearly d
     3 th chronic posttransfusion type c hepatitis (19). Our data corroborate the clinical impression that 
     4 ion recipients in whom hepatitis did not develop. Our data also indicate that the frequency of death
     5  some time. For them, liver disease truly exists. our experience, however, is that, at least for the
     6 empts to address both these issues, although only our findings on the frequency of fatal liver disea
     7 pective of the mode of transmission. Furthermore, our group of IVDU cases differed from the usual ur
     8 vely short follow-up periods of 5 to 10 years. To our knowledge, no previous study has evaluated as 
     9 patients who resolved or became chronic. However, our long-term follow-up seems to show that continu
    10  it was more common in the posttransfusion group. Our mean total HAI score (10.2) for the posttransf
    11  this final hypothesis. Whatever the explanation, our observation is of particular interest from a c
    12 disease may become apparent in later years. Thus, our observation of these study cohorts continues. 
    13 s have been recognized. Several groups, including our own, have expressed concern over the potential
    14 mic or other demographic factors, we believe that our patient population is suited for study of the 
    15 ear aetiology. The prevalence of HCV infection in our patients group was about 90%, which is in good
    16 immunodeficiency virus and HBsAg. Because most of our patients were educated suburbanites who had us
    17 the experience reported elsewhere. As only ten of our patients developed jaundice it is possible tha
    18 e described. A high percentage (more than 70%) of our patients developed chronic hepatitis. in a rec
    19 16 years, this occurrence is less common. Most of our patients have continued to do well with regard
    20 PHT may be an overestimate. The fact that 100% of our patients were NANB/H and that none had hepatit
    21 etectable levels of HBsAg. Eighty-nine percent of our patients with PHT had a short incubation perio
    22 lso to be reactive for anti-HBc. In fact, none of our patients with non-A, non-B hepatitis developed
    23 recent acquisition of HCV infections since all of our patients suffered from chronic hepatitis for a
    24 middle-aged before they are exposed to the virus; our patients were, on average, in their sixth deca
    25  may account for the antibody-negative results in our patients. First, it can be speculated that ser
    26 g the treatment occurred in approximately half of our patients. The relatively small number of patie
    27  did not exclude HBeAg-negative HBsAg carriers in our prospective study of PTH. In these carriers th
    28 n lost. The pattern of illness in the patients in our prospective study may indicate that at least t
    29 United States developed a hepatitis-like picture, our reported frequency of PHT may be an overestima
    30 degree of hepatic inflammatory activity (17, 18). Our results suggest that if a safe and effective t
    31 irus, i.e., there may be agent(s) other than HCV. Our results in these 6 patients were consistent wi
    32 ve than anti-C100 EIA in detecting HCV infection. Our results of HCV PCR also confirmed that primers
    33 n-A, non-B hepatitis (3.4%) closely comparable to our results. We conclude that non-A, non-B hepatit
    34 more, only 7% of the patients with hepatitis C in our series were considered sporadic, and this figu
    35 the hepatitis was most likely also due to HCV. In our series, serum HCV RNA was detected right at th
    36 EIA3 were highly concordant with serum HCV RNA in our series. Only 1 patient (Patient 18, Table 4) w
    37 uction of HBsAg positive HBV carrier chimpanzees. Our study also showed transient suppression of HBs
    38 r of units discarded. Nearly 8% of donor units in our study would have been lost if we had screened 
    39 ther reports. In support of other studies (27) in our study patients with chronic active hepatitis w
    40 d in percentages which range from 8.9% to 33%. In our study 13% of the chronic patients showed a spo
    41 19), 53% in Italy (20), and 67% in Spain (21). In our study the patient's age and sex, the number of
    42 s biweekly for 6 mo. The high frequency of PHT in our study may reflect the high carrier rate for NA
    43 H/NANB/H. This question could not be addressed in our study as ALT is not stable in stored blood and
    44  non-B hepatitis after cardiac surgery was low in our study (3.2%) as compared with other, similar s
    45 rgical procedure rather than the transfusions. In our study there was no difference in the incidence
    46 antibody to C 100-3 could be detected. However in our study group viral serology testing was conduct
    47  The risk profiles for hepatitis C transfusion in our study population shoe that massive transfusion
    48 ients who progressed from acute to chronic PHT in our study was similar to that reported from Bethes
    49 on. In addiction, the chronicity rate observed in our study (25/29, i.e., 86%) may be even higher, b
    50 mmunity in HCV infection. Sera from 2 subjects in our study were persistently positive by PCR beginn
    51 ransfused hospitalised patients. In the design of our study a control group of nontransfused hospita
    52 anfusion hepatitis. The most striking findings of our study were that each of 55 cases of probable p
    53 s. Despite normal aminotransferase levels some of our study patients may have histologic evidence of
    54 y can only be established by prospective studies. Our study is the only Canadian prospective study a
    55 nsfusion were more severe than in sporadic cases. Our study, of 83 liver biopsy specimens from patie
    56 m (28, 29). Had we excluded these four cases from our study, the difference between the two groups w
    57 which might prevent only 30% of cases (19-22). In our study, however, the more clinically severe cas
    58 e are inaccuracies in cause-specific diagnosis in our study. There is no reason to believe, however,
    59 ion to fatal disease must await the completion of our testing. The majority of cases of non-A, non-B
    60 s may contribute to more aggressive disease (25). Our univariate analysis supports the finding of Ma
    
    This tendency is well documented in FHC, contrasting notably with MCB-M, where our study does not appear at all. BNC-WAS, however, provides no less than 171 occurrences of this phrase, and confirms its use in concluding research papers, as well as in a number of recurrent collocations. Almost 60% of the BNC-WAS instances take the form in our study, and there are also clear patterns of associated verbs: suggests, shows, showed, found. While several of these also occur with our study in FHC, none does so more than once, making it impossible to draw conclusions.

    Here again, we find that the larger corpus provides the user with a source of further evidence, a means of testing hypotheses and investigating more detailed patterns with respect to the smaller one. But it also offers other advantages. It shows how far a feature may be generally applicable. Insofar as learners' precise future needs are unpredictable, such more generally applicable knowledge is relevant to them, just as it may be relevant for them to be aware that certain features are not more generally applicable, but restricted to a specific domain. The majority of ESP learners come from an EGP background, and one of their requirements is to understand the extent to which their previously-acquired knowledge of the language can be applied in the ESP context. Access to a large general corpus (such as the BNC) allows teachers and learners to compare uses in different text-types, thereby making for an awareness of the distinctive and shared features of the specific domain being investigated. To cite Halliday,

    if we recognise departure from a norm, then there has to be a norm to depart from. If we characterise register variation as variation in probabilities, as I think we must, it seems more realistic to measure it against observed global probabilities than against some arbitrary norm such as the assumption of equiprobability in every case.
    (Halliday 1992: 69)

    A large balanced corpus provides an opportunity for teacher and learner to relate the characteristics of a specific text-type, as revealed by a specialised corpus, to more general characteristics of the language, seen as a whole.

    To illustrate this point, let me further focus on the use of study as a noun. The concordance in figure 12 shows 30 randomly selected citations out of a total of 18,967 in the entire BNC. What appears from these citations is that as a noun, study has three main uses, the most frequent being that of a specific research project. Less frequent uses denote a general area of research ("the historical study of coins", "the study of literature/animal perception/experimental philosophy), and the activity of students in an institution ("study courses", "programmes of study"):

    Fig. 12. Study as a noun in the BNC

    
    1  our knowledge of the past. Since the historical study of coins began, in the Italian Renaissance, man
    2     Williams made similar deductions following a study carried out in the United States.| All three wr
    3  Ramsden and Snee used data from the DHSS Cohort Study. The Cohort Study interviewed a group of unempl
    4                                             CASE STUDY: NEW DIVISION| The division was formed two year
    5                  The principal conclusion of the study was that all three of the series of square-head
    6  eived less attention than the motor aspect, the study of animal perception (as opposed to sensory phy
    7  ful that finding was obtained to set up a panel study. A sample of 5,362 of the original children was
    8  ugh the BMA's anxieties were not supported by a study carried out by the Department of Health's Advis
    9  wever a ž type error cannot be excluded in this study because the prestudy assumption was that microa
    10 tion come mainly from a participant observation study at Oxford United and from conversations at a 
    11 67). Sprat suggested that, of all pursuits, the study of experimental philosophy was most likely to e
    12 is acceptable.| In the case of the eye movement study, subjects were presented with an array of four 
    13 ormation on reactions to imprisonment. In a USA study, Tittle found that women associated with a best
    14 m down as much as she could, by semi-obligatory study courses, and quasi-essential trips to the conti
    15                      4.3.6 The rural hinterland study | This part of the Belfast project does not hav
    16  and Vagg, 1988; Waddington et al., 1990). In a study carried out by Morris and Heal (1981), resident
    17  here some of its positive implications for the study of literature.
    18 on reactions to Chernobyl. Award Title:  A case study of consumer acceptance of new technology in a b
    19 tive dissemination of research results from the study). Award Title: Choice in scientific work: A com
    20 seemed to support notions of repression. In the study (Levinger & Clark, 1961) emotional words were s
    21 unts of chemical elements in rocks. Geophysics: study of the Earth by quantitative physical methods. 
    22 2-term course on Theory and Methods of Literary Study or Theory and Criticism and two l-term courses 
    23 t entirely supported.| For example, in an ARTEP study of export zones in Sri Lanka, South Korea, the 
    24 ortant features in patients with DU. and in our study we found that these were both increased in the 
    25 d for a significantly greater proportion of the study time than in normal controls. This is consisten
    26 ilirubin and were included consecutively in the study. Informed written consent was obtained before t
    27 ients had surgery during the second year of the study; two with total ulcerative colitis had a colect
    28 ded from the date of diagnosis. In the Survival Study, children could move from one treatment group t
    29 n's High School and Woodfarm High School.| Case study - Langside College, Glasgow
    30 ese regulations are effective for programmes of study starting in and after September 1990.
    

    This concordance suggests that the use of study to refer to a specific research project is quite the most common in the language as a whole, and it reveals something of the domains and collocations in which it occurs. Medicine is well represented, confirming the FHC data in this respect (the one occurrence here of our study, as it happens, is from a medical text), but a wide range of other academic fields are also present, from linguistics to economics. The term case study occurs three times, as well as a number of other premodifiers indicating the type, source or subject of the study in question. (A) study can be of something, or carried out by someone. The frequency of these collocations in the whole BNC can also be tested: perhaps surprisingly, carried out appears immediately following study in only 53 cases, or under 0.3%. The two most frequent postmodifying participles of study in FHC, conducted and done, occur yet less often, with 0.2 and 0.1% cases of study as a noun.

    This kind of analysis potentially allows teacher and learner to contextualise uses encountered in a small specific corpus against a broader linguistic background. By revealing patterns not present in the small corpus, it highlights potentially distinctive characteristics of the latter. At the same time it allows other meanings and uses to be noted and incidentally learnt. And here we return to the procedural issue. The process of negotiating meaning in such cases engages users in dealing with unknown and unforeseen text, as they will inevitably have to. There may, of course, be considerable problems in interpreting data drawn from such a wide spectrum, whose content is likely to be less familiar and less predictable to the user (Aston 1997). But there is nothing to stop such data being used selectively. The teacher can edit concordances of this kind before passing them to the learner; learners can focus on those lines of which they are able to make sense, and which appear to be of relevance to their interests. Faced with the concordance above, ESP learners might simply be asked to identify which of these citations use study in the sense of a piece of research - an activity which engages and develops procedures of disambiguation based on collocational and colligational patternings.

    Whether the aim is to develop declarative or procedural knowledge which are relevant to the learner, or both, it seems to me that there are thus strong arguments for not confining our attention to any one type of corpus. Small specific corpora have obvious virtues in highlighting recurrent specialised features, but only larger and more general ones seem able to capture less specialised ones, and to contextualise such features against a broader spectrum of abilities and awareness. How, in a practical pedagogy, such multiple resources can be exploited to maximum effect is a further methodological issue which goes beyond the limits of this brief discussion.

    References

  • Aston, G. 1995. "Corpora in language pedagogy: matching theory and practice". In G. Cook & B. Seidlhofer (eds), Principle and practice in applied linguistics. Oxford: Oxford University Press.
  • Aston, G. 1996. "Involving learners in developing learning methods: exploiting text corpora in self-access". In P. Benson & P. Voller (eds), Autonomy and independence in language learning. London: Longman.
  • Aston, G. 1997. "Small and large corpora in language learning". Paper presented at PALC'97, Lodz.
  • Aston, G. (ed) forthcoming. Learning with corpora.
  • Atkins, S., J. Clear & N. Ostler 1992. "Corpus design criteria". Literary and linguistic computing, 7/1. 1-16.
  • Biber, D., S. Conrad & R. Reppen 1994. "Corpus-based approaches to issues in applied linguistics". Applied linguistics, 15/2. 168-189.
  • Biber, D. & E. Finegan 1994. "Intra-textual variation within medical research articles". In N. Oostdijk & P. de Haan (eds), Corpus-based research into language. Amsterdam: Rodopi.
  • Brodine, R. forthcoming. "Integrating corpus work into an academic reading course". In Aston, forthcoming.
  • Burnard, L. (ed) 1995. Users reference guide for the British National Corpus. Oxford: Oxford University Computing Services.
  • Flowerdew J. 1993. "Concordancing as a tool in course design". System, 21. 213-229.
  • Gavioli, L. forthcoming. "The learner as researcher: introducing corpus concordancing in the classroom". In Aston, forthcoming.
  • Halliday, M.A.K. 1992. "Language as system and language as instance". In J. Svartvik (ed) Directions in corpus linguistics. Berlin: Mouton De Gruyter.
  • Johns, T.F. 1991. "Should you be persuaded: two examples of data-driven learning". In Johns & King.
  • Johns, T.F. 1994. "From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning". In T. Odlin (ed.), Perspectives on pedagogical grammar. Cambridge: Cambridge University Press.
  • Johns, T.F. & P. King (eds), Classroom concordancing. English language research journal, 4.
  • Ljung, M. 1991. "Swedish TEFL meets reality". In S. Johansson & A-B. Stenstr”m (eds), Computer corpora: selected papers and research guide. Berlin: Mouton de Gruyter.
  • Mparutsa, C., A. Love & A. Morrison 1991. "Bringing concord to the ESP classroom". In Johns & King.
  • Murison-Bowie, S. 1993. Microconcord: manual. Oxford: Oxford University Press.
  • Scott, M. 1996. Wordsmith tools. Oxford: Oxford University Press.
  • Skehan, P. 1981. "ESP teachers, computers and research". ELT documents, 112. 106-125.
  • Sta, J-D. 1995. "Comportement statistique des termes et acquisition terminologique a partir de corpus". Traitement automatique des langues, 36/1-2. 119-132.
  • West, M. (ed) 1953. A general service list of English words. London: Longman.
  • Widdowson, H.G. 1984. Explorations in applied linguistics 2. Oxford: Oxford University Press.
  •