By corpora, I mean collections of homogeneously-encoded computer-readable text compiled for linguistic purposes. These may be corpora which are already publicly available, or ones which are constructed ad hoc for particular purposes by the teacher or researcher. In either case, such corpora can be analysed with concordancing software to provide three main types of secondary data:
The first of these pedagogic uses of corpora has a long-established tradition in ESP. Well before the advent of the computer, researchers had created and analysed corpora manually as an aid to syllabus and materials design. For instance, West (1953) analysed the most common words and word-senses in 5 million words of written text, with the aim of facilitating the selection and grading of vocabulary for EFL purposes, as well as for the writing of popular science and technology. The growth of ESP in universities in the 1970s and 1980s coincided with the spread of personal computers and of optical character recognition and concordancing software, allowing researchers to create and analyse their own corpora of specific domains with far greater ease. For instance, Skehan (1981) wrote a simple programme to analyse a corpus of economics texts with the aim of identifying the proportions of general, sub-technical, and specialised lexis. He concluded that "if our aim is to put [ESP] students into a position from which they can understand 80% to 90% of the total running words in texts, the emphasis in vocabulary teaching must be on sub- technical words", such as rate and process (Skehan 1981: 117). Flowerdew (1993) directly derived the lexicogrammatical content of an EAP course from an analysis of a corpus of biology textbooks. Several studies have compared collections of authentic texts with ones of published teaching materials, well summed up in the title of a paper by Ljung (1991), "Swedish TEFL meets reality". In recent years, corpus-linguistic studies of register have provided empirical justifications for such approaches:
The markedly different patterns of linguistic form and function that occur across
registers indicate that there is no single set of linguistic features that should be
emphasized for all students, once they have mastered the rudiments of English
grammar. Rather, it is important to teach the linguistic characteristics and functions
of particular target registers, so that students will be able to control the language
structures they encounter in actual discourse and to adjust their language use
appropriately for different registers.
(Biber et al 1994: 174)
While these studies underline the function of corpora in defining the contents of ESP work, seeing them as instruments for the teacher and materials writer, a similar perspective can also be adopted from the learner's viewpoint. By investigating corpus data, learners can discover how language is used in the domain they are concerned with in a process of "data- driven learning", drawing their own inferences about the meanings and uses of particular features (Johns 1991, 1994). Aston (1995, 1996) underlines how such activity can be learner- rather than teacher-directed, with appropriate corpora providing a self-access resource from which learners can derive information for themselves, complementing that available from dictionaries, grammars, and encyclopaedias. Figure 1 shows a concordance of the word jaundice taken from a collection of research papers in hepatology (FHC: see below). From it, ESP learners may be able to derive much potentially relevant information: that jaundice is something that patients can have or develop, that it may be severe, and that it is often judged a symptom of hepatitis, along with such disturbances as anorexia and nausea. They may also opt to examine a rather wider context in order to find out what "postpump" jaundice is:
Fig. 1. Jaundice in FHC
1 became raised and 10 patients developed jaundice. All of the nine patients who had no
2 full-blown clinical symptoms including jaundice (1). Those who have mild fatigue oft
3 ent, taking particular note of previous jaundice, hepatitis, blood transfusion, drug
4 rexia, rigors, joint pains, dark urine, jaundice). Virological studies- Markers of he
5 d a history of cirrhosis, hepatitis, or jaundice; three studies excluded patients wit
6 esence of one or more of the following: jaundice, loss of appetite, weight loss, or f
7 thin two postoperative days) "postpump" jaundice not associated with raised transamin
8 e recruitment (Table 3). No patient had jaundice, but hepatomegaly was present in 28
9 our weeks apart. Ten patients developed jaundice, and six of these were acutely ill (
10 6, 11, 17) though they did not develop jaundice, had classical symptoms of acute hep
11 of factor VIII, and all four had severe jaundice and anorexia and suffered from nause
12 gy. Among the 10 patients who developed jaundice the incubation period was under five
13 ncubation period was 12 weeks developed jaundice (figure).
14 phase of illness; approximately 15% had jaundice. None of those in whom chronic hepat
15 . As only ten of our patients developed jaundice it is possible that the hepatitis wo
Whether from the teacher's or the learner's perspective, the approaches so far described view corpora as a source of information. They focus on the product of the analysis. The knowledge derived is, in the terms of cognitive psychology, declarative: a series of facts about the language and the culture. Such knowledge may not of course be accurate: this will depend on the appropriacy of the corpus data and the ability of the user to analyse and generalise from it. But most if not all declarative knowledge involves simplification, just as interlanguage research has shown that second language acquisition effectively involves successive approximations to the target system.
When considering the use of corpora by learners as well as by teachers and researchers, however, we should also consider a second aspect of corpus use, focussing instead on the analytic process. Work with corpora can also provide a means of developing knowledge which is procedural, in the sense of an ability to use language to carry out problem-solving operations. Brodine (forthcoming) argues that the interpretation of concordance data primarily involves "bottom-up" interpretation based on the local co-text, as opposed to the "top-down" interpretative processes based on the global context and overall text structure which are emphasised by most reading materials. In the jaundice example in Figure 1, several of the citations include a list of symptoms in a complex nominal group: to understand these citations, the reader must identify this group, analyse its internal structure, and assign it a syntactic role in the clause and sentence. Such procedures of chunking and grammatical analysis are of obvious relevance to the development of reading skills. Figure 2 illustrates how a concordance of the pronoun they can provide the learner with practice in resolving anaphoric reference:
Fig. 2. They in FHC
1 SA test systems became available that contain additional viral antigens. Experience gained so far with these systems has shown that they provide increased sensitivity (Craxi et al., 1991). The polymerase chain reaction (PCR) (Saiki et al., 1988) is an 2 cyrrhosis, and hepatocellular carcinoma are now accepted sequelae, their frequency their rate of development, and the degree to which they contribute to mortality are not yet well established, since current data come largely from retrospective studies. Prospectively 3 the 29 acute hepatitis C patients. Anticapsid antibody appeared earlier than the antinonstructural antibody in 10 seroconverters. They appeared simultaneously in 15 seroconverters but anticapsid antibody appeared later than the antinonstructural antibody in 3 4 clinical symptoms including jaundice (1). Those who have mild fatigue often blame it, perhaps correctly, on the disease for which they received the transfusion. Very few in whom prolonged abnormalities in serum aminotransferase levels develop 5 he absence of symptoms and may become the focus of attention or of therapy (8). Disease is perceived by patients, however, if they have symptoms. Signs such as an abnormal ALT level or histologic evidence of inflammation do not themselves interfere
If the use of corpora can potentially develop both declarative and procedural knowledge in the learner, a key question is what kinds of corpora are most appropriate for these purposes. This question is clearly multi-faceted, and in this paper I will only touch on a few of the issues involved. It is evident that the choice of a corpus, and of particular data from that corpus, will always primarily depend upon the nature of the specific pedagogic situation, and on the need to train learners to use corpora effectively (Gavioli, forthcoming). But there is also I think a more general issue of the kinds of corpora that are most appropriate for ESP, be it from the perspective of the teacher or researcher working in syllabus and materials design, or from that of the learner who not only needs to understand the workings of language in a particular domain but also to practice engaging with it. In this paper I consider three related aspects of this issue: the extent to which the texts contained in a corpus should be limited to the particular domain of ESP (corpus specificity); the number and length of the text samples it should contain (corpus size); and the extent to which these mirror the universe of discourse which the corpus aims to capture (corpus representativeness).
I begin with some general comparisons of the lexical content of the first two of these corpora (FHC and MCB-M), carried using the Wordlist and Keywords components of Wordsmith tools (Scott 1996). I shall then go on to make a number of comparisons for specific items with the third, much larger corpus, BNC-WAS. While lexical analysis is not the only possible pedagogic use of corpora, it is likely to be a major one, and I believe that the problems posed by a lexical comparison can largely be extended by analogy to other areas.
There are substantial differences in the frequencies of word-forms in FHC and MCB-M. Figure 3 lists those items which are significantly more frequent in FHC:
Fig. 3. Word-forms more frequent in FHC than MCB-M (in descending order of
significance)
FHC MCB-M chi-square
1 HEPATITIS 817 (1.90%) 16 3787.9
2 PATIENTS 741 (1.73%) 245 (0.12%) 2298.9
3 HCV 301 (0.70%) 0 1427.9
4 TRANSFUSION 300 (0.70%) 0 1423.2
5 CHRONIC 238 (0.55%) 15 1032.8
6 ANTI 226 (0.53%) 11 999.9
7 LIVER 233 (0.54%) 20 980.1
8 NON 309 (0.72%) 96 (0.05%) 977.4
9 WERE 583 (1.36%) 522 (0.26%) 966.2
10 ALT 194 (0.45%) 0 917.9
11 STUDY 228 (0.53%) 72 (0.04%) 715.3
12 B 268 (0.62%) 142 (0.07%) 655.8
13 BLOOD 226 (0.53%) 89 (0.04%) 646.0
14 CIRRHOSIS 136 (0.32%) 0 641.6
15 C 212 (0.49%) 84 (0.04%) 604.0
16 SERUM 116 (0.27%) 4 520.5
17 NANB 110 (0.26%) 0 517.8
18 POSITIVE 136 (0.32%) 33 (0.02%) 464.8
19 POST 112 (0.26%) 14 444.3
20 DISEASE 191 (0.45%) 127 (0.06%) 401.4
21 BIOPSY 80 (0.19%) 0 375.0
22 ANTIBODY 82 (0.19%) 2 371.5
23 HBC 77 (0.18%) 0 360.7
24 PCR 75 (0.17%) 0 351.2
25 RECIPIENTS 64 (0.15%) 0 298.8
26 MONTHS 121 (0.28%) 63 (0.03%) 297.1
27 HBSAG 63 (0.15%) 0 294.1
28 DONORS 63 (0.15%) 0 294.1
29 VIRUS 112 (0.26%) 52 (0.03%) 293.0
30 TRANSAMINASE 58 (0.14%) 0 270.3
31 FOLLOW 93 (0.22%) 35 (0.02%) 268.9
32 LEVELS 74 (0.17%) 17 254.9
33 CORE 60 (0.14%) 4 254.9
34 ACUTE 78 (0.18%) 22 (0.01%) 252.0
35 SAMPLES 67 (0.16%) 12 245.8
36 RNA 57 (0.13%) 4 240.8
37 DEVELOPED 100 (0.23%) 54 (0.03%) 239.6
38 UNITS 86 (0.20%) 35 (0.02%) 239.5
39 TESTED 80 (0.19%) 29 (0.01%) 234.6
40 DONOR 51 (0.12%) 1 230.5
41 POSTTRANSFUSION 49 (0.11%) 0 227.4
42 VIII 50 (0.12%) 1 225.7
43 TRANSFUSED 47 (0.11%) 0 217.9
44 AMINOTRANSFERASE 47 (0.11%) 0 217.9
45 PROSPECTIVE 48 (0.11%) 2 210.0
46 VIRAL 61 (0.14%) 14 209.4
47 HBV 45 (0.10%) 0 208.4
48 SEROCONVERSION 43 (0.10%) 0 198.9
49 TRANSFUSIONS 42 (0.10%) 0 194.1
50 FACTOR 58 (0.14%) 15 191.8
Not surprisingly, FHC shows much higher frequencies of domain-specific lexis. We can
note that of the first 10 significantly more frequent words, nearly all are lexical rather than
functional: hepatitis, patients, HCV,
transfusion, chronic, anti, liver, and
ALT (the name of a measure used in testing for hepatitis). Several of these
(HCV, transfusion, ALT) do not occur at all in MCB-M.
The only function words in the first 10 are non and were, in 8th
and 9th position respectively. These data suggest that FHC provides a much more
appropriate tool than MCB-M for highlighting lexis relevant to hepatology research. This
is confirmed by a comparison of wordlists sorted by frequency for the two corpora. In
MCB-M, the 25 most frequent forms are all function words, while in FHC they include
a number of lexical ones (hepatitis, patients, HCV,
transfusion, chronic, liver, study and
anti), as well as some new functional ones (were, non,
had and from) (figure 4):
Fig. 4. Most frequent words in FHC and MCB-M
FHC Freq. % MCB-M Freq. % 1 THE 2149 (5.01%) THE 12525 (6.14%) 2 OF 2059 (4.80%) OF 7607 (3.73%) 3 IN 1077 (2.51%) TO 6116 (3.00%) 4 AND 1012 (2.36%) AND 5222 (2.56%) 5 HEPATITIS 817 (1.90%) A 4866 (2.38%) 6 TO 744 (1.73%) IN 4499 (2.20%) 7 PATIENTS 741 (1.73%) IS 3354 (1.64%) 8 A 681 (1.59%) THAT 2937 (1.44%) 9 WERE 583 (1.36%) BE 2245 (1.10%) 10 WAS 523 (1.22%) IT 1946 (0.95%) 11 FOR 487 (1.14%) FOR 1810 (0.89%) 12 WITH 464 (1.08%) AS 1618 (0.79%) 13 NON 309 (0.72%) THIS 1485 (0.73%) 14 HCV 301 (0.70%) WITH 1475 (0.72%) 15 TRANSFUSION 300 (0.70%) OR 1453 (0.71%) 16 OR 280 (0.65%) ARE 1411 (0.69%) 17 B 268 (0.62%) NOT 1283 (0.63%) 18 HAD 267 (0.62%) WAS 1274 (0.62%) 19 BY 261 (0.61%) YOU 1258 (0.62%) 20 THAT 241 (0.56%) BY 1246 (0.61%) 21 CHRONIC 238 (0.55%) WHICH 1084 (0.53%) 22 LIVER 233 (0.54%) ON 1048 (0.51%) 23 FROM 231 (0.54%) HAVE 996 (0.49%) 24 STUDY 228 (0.53%) AN 884 (0.43%) 25 ANTI 226 (0.53%) BUT 831 (0.41%)
This same lexical/functional difference appears from a comparison of the most frequent 4-word clusters in the two corpora, where we again find a clear contrast. While numbers of occurrences of these are, bar the first two clusters in FHC, similar in the two corpora, the percentage figures show that their relative frequency is much smaller in MCB- M (figure 5):
Fig. 5. Most frequent 4-word clusters in FHC and MCB-M
FHC Freq. % MCB Freq. % 1 NON-A NON-B 133 (0.31%) AT THE END OF 24 (0.01%) 2 A NON-B HEPATITIS 99 (0.23%) IN THE CASE OF 23 (0.01%) 3 OF NON-A NON 35 (0.08%) PRE-EXPOSURE TO THE 23 (0.01%) 4 OF POST-TRANSFUSION HEPATITIS 28 (0.07%) A CHANGE OF CONTEXT 22 (0.01%) 5 HEPATITIS B SURFACE ANTIGEN 19 (0.04%) AT THE SAME TIME 22 (0.01%) 6 WITH NON-A NON 19 (0.04%) IN THE UNITED STATES 20 7 NON-B HEPATITIS IN 18 (0.04%) PER CENT OF CASES 20 8 POST-TRANSFUSION HEPATITIS C 17 (0.04%) THE SEXUALLY TRANSMITTED DISEASES 19 9 RELATED TO LIVER DISEASE 17 (0.04%) IN THE ABSENCE OF 18 10 A NON-B POST 16 (0.04%) AS A RESULT OF 17 11 B POST-TRANSFUSION HEPATITIS 15 (0.03%) EXPOSURE TO THE CONTEXT 17 12 NON-B POST-TRANSFUSION 15 (0.03%) IS LIKELY TO BE 17 13 THE NUMBER OF UNITS 15 (0.03%) ON THE BASIS OF 17 14 THE UPPER LIMIT OF 15 (0.03%) THE EXTENT TO WHICH 17 15 UPPER LIMIT OF NORMAL 15 (0.03%) WILL BE ABLE TO 16
This tendency is confirmed if, rather than comparing particularly frequent items in these corpora, we examine particularly infrequent ones. Figure 6 lists some hapax legomena in each corpus, or word-forms which only occur once:
Fig. 6. Hapax legomena beginning with the letters GO
FHC BNC z-score MCBM BNC z-score 1 GO 8.15 1 GOA -0.01 2 GOAL 0.52 2 GODDEN -0.01 3 GOING 5.97 3 GODSEND -0.01 4 GOOD 7.27 4 GOODS 0.90 5 GORDON 0.22 5 GOOSE 0.03 6 GOT 8.39 6 GORDON 0.22 7 GORY -0.01 8 GOSIO -0.01 9 GOVERNORS 0.13 10 GOWLAND -0.01
Investigating the hapax legomena in a corpus provides a way of estimating its reliability for pedagogic purposes. If we count their number and divide this by the total number of word-tokens in the corpus, this tells us how often we encounter a word-form which occurs only once. This is the same as the probability of encountering a word-form not present in the corpus if we were to read a further text from the same population. For FHC, this figure is 3.3% (1435/42,857), while for MCB-M it is only 2.1% (4456/203,419). Notwithstanding its lesser specialisation, in other words, with its greater size MCB-M does in fact seem to provide greater predictive power at a lexical level for its domain.
The overall picture that emerges from these comparisons is that FHC, by virtue of its specificity, gives more information concerning specialised lexis relating to this domain, while MCB-M gives more information concerning general and sub-technical items. The most striking example of this difference is found in the lists of 4-word clusters in the two corpora (figure 5 above), where MCB-M gives prominence to a series of conjunctive and modal expressions which act as general text-organising devices: in the case of, in the absence of, as a result of, on the basis of, is likely to be, will be able to. But similar differences emerge when we examine sub-technical items and their collocations. For instance, the word problems occurs only three times in FHC (figure 7):
Fig. 7. Problems in FHC
1 subsequently experienced significant clinical problems related to hepatic inflammation; 2 located. Questions were designed to identify problems related to hepatic failure. 3 known; the family was not aware of any liver problems in this patient.While these data suggest some possible collocations - that one is aware of, experiences or identifies problems related to a particular disease - the number of examples is clearly insufficient to assess these reliably. If, on the other hand, we turn to MCB-M, we find many more occurrences of problems: 45 in all. There are no examples of problems related, but there are instead several of problems associated (figure 8):
Fig. 8. Problems associated in MCB-M
1 fat. There will also be additional health problems associated with lack of exercise. If your leg 2 worsen the symptoms of `jet-lag" and any problems associated with shift-work but also it impairs 3 have a much more open view of sex and the problems associated with it and have a less punitive ap 4 errelated. They are: 1. Assessment of the problems associated with the overdose, of the overdose 5 ndition which superficially resembles the problems associated with disseminated gonococcal infect 6 husband over a year previously. When the problems associated with the microscopic diagnosis, parThis suggests that this collocation may also be appropriate in the hepatitis domain.
The available evidence is nonetheless still insufficient to judge with any confidence whether problems are related or associated, or what differences there may be between the two. To gather yet more information concerning these collocations we can turn to the BNC applied science texts. Here we find 38 occurrences of problems associated, and only 3 of problems related (figures 9 and 10):
Fig. 9. Problems associated in BNC-WAS
1 chieved without individual and fiddling adjustment. The problems associated with the need for perfect 2 essed over this issue is actually only addressing those problems associated with animal species. Plan 3 nt bactericidal characteristics would solve many of the problems associated with both cleaning and di 4 built in wet conditions. I have known the condensation problems associated with `drying out" to last 5 Flat Roofs | The problems associated with felted flat roofs ar 6 etailed geographic information to overcome some of the problems associated with linking data sets. 7 een developed to provide tools for the solution of many problems associated with emergency planning ( 8 aken overall the exercise had focussed attention on the problems associated with the filing systems o 9 quality petrol during the Second World War. Many of the problems associated with hydrocarbon producti 10 to study. Compendia of information cannot solve all the problems associated with mapping out one's li 11 the features of company and partnership but without the problems associated with limited partnerships 12 y as impractical. They do however perhaps highlight the problems associated with nomination, and dire 13 In view of the problems associated with training and staff m 14 ed that indistinct analysis of needs will contribute to problems associated with evaluation. If evalu 15 esh as an appropriate technology by trying to avoid the problems associated with the pre-packaged sal 16 level nuclear waste had admitted it has not solved the problems associated with the migration of rad 17 water extracted from the ground. However, sometimes the problems associated with large trees can only 18 as an aid for one person will not usually encounter the problems associated with elaborate design tea 19 f statistical survey reveals the scale and diversity of problems associated with the mind at work. Th 20 rounded off the day with a highly technical look at the problems associated with the issuing of flyin 21 makes sense on logical grounds, since it gets round the problems associated with grandmother cells, l 22 nologies. The software is said to solve the fundamental problems associated with software development 23 short supply, and help to alleviate some of the health problems associated with living in confined a 24 definitions, of which 88% was correct), there are still problems associated with the processing of id 25 of current practices has shown that the main practical problems associated with these schemes did no 26 g is to encourage a multidisciplinary discussion of the problems associated with the use of subsymbol 27 look very attractive, because it eliminates many of the problems associated with open-loop control (m 28 using a special keyboard having 21 keys. There are two problems associated with converting the steno 29 encyclopaedic knowledge, semantic networks). There are problems associated with extracting semantic 30 ormation as possible between the two tagsets. There are problems associated with combining the two ta 31 ound. Furthermore as the matrix becomes less sparse the problems associated with storage increase. Al 32 organized by a Data Base Management System (DBMS). The problems associated with data collection, inp 33 immunohistochemical technique has overcome many of the problems associated with the tritiated thymid 34 re caused by operational mishandling and other physical problems associated with the constant mountin 35 e transferred in one cycle. The Geneva report lists the problems associated with age, cycles attempte 36 ught up in this debate because of the health and social problems associated with alcohol abuse - even 37 t abuse alcohol and do not suffer from health or social problems associated with their drinking. Medi 38 discriminate against imported products. The logistical problems associated with the scheme have also
Fig. 10. Problems related in BNC-WAS
1 d on the form of the emergency. Most would have tackled problems related to sheep on their own, but i 2 made to simplify insertion and use of the machine. Yet, problems related to patients could not be mod 3 ed to a failure to recognise Crohn's disease, technical problems related to pouch construction and ilThe frequency data seems here fairly conclusive.
These data suggest that on the one hand FHC is unreliable where relatively infrequent features are concerned, and on the other that larger and more general corpora can provide appropriate data for the study of features which are less domain-specific. We cannot deduce from FHC that problems are more frequently related than associated in hepatology texts, as the numbers are too small, but from the BNC we can deduce that there seems no reason why problems associated should not also occur in such texts. By comparing results from small specific and large general corpora, an idea can be formed of the extent to which a particular feature is or is not restricted to a specific domain. The larger corpus, by indicating the generality of the expression in question, highlights patterns of use which are not apparent in the small one, but which might well be found were the latter larger.
One argument in favour of the use of large, general corpora as well as small, specific ones has now been outlined. On the level of lexis, the literature bears witness to a widespread conviction that, in ESP, terminology is rarely the central learning problem. More commonly, learning problems relate to a sub-technical lexical level, which is, by definition, less specialised (Skehan 1981, Mparutsa et al 1991): one of the ways in which terminology can be automatically identified is in terms of its greater tendency to occur only in a limited range of texts (Sta 1995). Such lexis is therefore likely to be better documented in a larger, more general corpus than in a small specific one, thereby better allowing reliable declarative knowledge in this area to be inferred by the user. I now turn to a second argument, which also focusses on the procedural aspects of the question.
It is arguably as important for ESP learners to develop strategies for dealing with the unpredicted as it is for them to be familiar with the predictable. All learners are likely to have to face situations which cannot be fully predicted linguistically, and which do not conform to the content of any given corpus. However reliable a corpus may be as a sample, its reliability in pedagogic terms will always be limited where the learner's target requirements cannot be precisely defined. ESP courses, which must gloss over individual differences in needs, are both unable to define these requirements precisely, and unable to avoid averaging them out. In any case, we have seen that any new text, even from the same domain, will almost certainly contain features unmatched within a corpus, however representative, large and specific it may be. There is no way of reducing the proportion of hapax legomena in a corpus to zero.
In pedagogic terms, the essence of this point has been well put by Widdowson, who criticises approaches to ESP which aim to specify the syllabus in terms of the features of target registers:
Communicative competence means the ability to enact discourse and so to exploit
a knowledge of rules (usage and use) in order to arrive at a negotiated settlement.
It is essentially a capacity for solving problems, not a facility for producing
prepared utterances. So if we are going to specify a restricted repertoire, it should
be represented as a range of problem-solving strategies, involving the contingent use
of language, not a collection of items.
(Widdowson 1984: 197-8)
This brings us back to the second use of corpora in teaching mentioned in the first section of this paper, that of providing opportunities for learners to engage in interpreting discourse in a controlled manner: chunking text, assigning anaphoric reference, and the like. For such purposes, it can be argued that the choice of texts need not necessarily be restricted to a limited target domain, since the same kinds of problems to develop problem-solving strategies will also be found elsewhere: research is not limited to hepatology. Many such problems are in fact likely to involve interpreting functional and sub-technical items, which are, as we have seen, very differently distributed from specialised lexical ones: indeed variation in the distribution of these items may be no higher in a more general corpus than in a domain-specific one. In their analysis of scientific texts, Biber and Finegan (1994) (1994) found that for features relating to factors of informativeness and abstractness, the extent of variation across different sections of medical texts (introduction, methods, results, discussion) was almost as great as overall variation within scientific texts in general.
The tendency for certain features to cluster in particular parts of texts can be seen as a matter of functional specialisation which cuts across domains. For instance, of 68 occurrences of the word our in FHC, no less than 60 occur in discussion sections, 22 of these in the collocation our study (figure 11):
Fig. 11. Our in discussion sections of FHC
1 able nonviral causes of transaminase elevation by our clinical review committee may also have played 2 view had advanced disease-a finding that supports our conclusion. Recent reports have more clearly d 3 th chronic posttransfusion type c hepatitis (19). Our data corroborate the clinical impression that 4 ion recipients in whom hepatitis did not develop. Our data also indicate that the frequency of death 5 some time. For them, liver disease truly exists. our experience, however, is that, at least for the 6 empts to address both these issues, although only our findings on the frequency of fatal liver disea 7 pective of the mode of transmission. Furthermore, our group of IVDU cases differed from the usual ur 8 vely short follow-up periods of 5 to 10 years. To our knowledge, no previous study has evaluated as 9 patients who resolved or became chronic. However, our long-term follow-up seems to show that continu 10 it was more common in the posttransfusion group. Our mean total HAI score (10.2) for the posttransf 11 this final hypothesis. Whatever the explanation, our observation is of particular interest from a c 12 disease may become apparent in later years. Thus, our observation of these study cohorts continues. 13 s have been recognized. Several groups, including our own, have expressed concern over the potential 14 mic or other demographic factors, we believe that our patient population is suited for study of the 15 ear aetiology. The prevalence of HCV infection in our patients group was about 90%, which is in good 16 immunodeficiency virus and HBsAg. Because most of our patients were educated suburbanites who had us 17 the experience reported elsewhere. As only ten of our patients developed jaundice it is possible tha 18 e described. A high percentage (more than 70%) of our patients developed chronic hepatitis. in a rec 19 16 years, this occurrence is less common. Most of our patients have continued to do well with regard 20 PHT may be an overestimate. The fact that 100% of our patients were NANB/H and that none had hepatit 21 etectable levels of HBsAg. Eighty-nine percent of our patients with PHT had a short incubation perio 22 lso to be reactive for anti-HBc. In fact, none of our patients with non-A, non-B hepatitis developed 23 recent acquisition of HCV infections since all of our patients suffered from chronic hepatitis for a 24 middle-aged before they are exposed to the virus; our patients were, on average, in their sixth deca 25 may account for the antibody-negative results in our patients. First, it can be speculated that ser 26 g the treatment occurred in approximately half of our patients. The relatively small number of patie 27 did not exclude HBeAg-negative HBsAg carriers in our prospective study of PTH. In these carriers th 28 n lost. The pattern of illness in the patients in our prospective study may indicate that at least t 29 United States developed a hepatitis-like picture, our reported frequency of PHT may be an overestima 30 degree of hepatic inflammatory activity (17, 18). Our results suggest that if a safe and effective t 31 irus, i.e., there may be agent(s) other than HCV. Our results in these 6 patients were consistent wi 32 ve than anti-C100 EIA in detecting HCV infection. Our results of HCV PCR also confirmed that primers 33 n-A, non-B hepatitis (3.4%) closely comparable to our results. We conclude that non-A, non-B hepatit 34 more, only 7% of the patients with hepatitis C in our series were considered sporadic, and this figu 35 the hepatitis was most likely also due to HCV. In our series, serum HCV RNA was detected right at th 36 EIA3 were highly concordant with serum HCV RNA in our series. Only 1 patient (Patient 18, Table 4) w 37 uction of HBsAg positive HBV carrier chimpanzees. Our study also showed transient suppression of HBs 38 r of units discarded. Nearly 8% of donor units in our study would have been lost if we had screened 39 ther reports. In support of other studies (27) in our study patients with chronic active hepatitis w 40 d in percentages which range from 8.9% to 33%. In our study 13% of the chronic patients showed a spo 41 19), 53% in Italy (20), and 67% in Spain (21). In our study the patient's age and sex, the number of 42 s biweekly for 6 mo. The high frequency of PHT in our study may reflect the high carrier rate for NA 43 H/NANB/H. This question could not be addressed in our study as ALT is not stable in stored blood and 44 non-B hepatitis after cardiac surgery was low in our study (3.2%) as compared with other, similar s 45 rgical procedure rather than the transfusions. In our study there was no difference in the incidence 46 antibody to C 100-3 could be detected. However in our study group viral serology testing was conduct 47 The risk profiles for hepatitis C transfusion in our study population shoe that massive transfusion 48 ients who progressed from acute to chronic PHT in our study was similar to that reported from Bethes 49 on. In addiction, the chronicity rate observed in our study (25/29, i.e., 86%) may be even higher, b 50 mmunity in HCV infection. Sera from 2 subjects in our study were persistently positive by PCR beginn 51 ransfused hospitalised patients. In the design of our study a control group of nontransfused hospita 52 anfusion hepatitis. The most striking findings of our study were that each of 55 cases of probable p 53 s. Despite normal aminotransferase levels some of our study patients may have histologic evidence of 54 y can only be established by prospective studies. Our study is the only Canadian prospective study a 55 nsfusion were more severe than in sporadic cases. Our study, of 83 liver biopsy specimens from patie 56 m (28, 29). Had we excluded these four cases from our study, the difference between the two groups w 57 which might prevent only 30% of cases (19-22). In our study, however, the more clinically severe cas 58 e are inaccuracies in cause-specific diagnosis in our study. There is no reason to believe, however, 59 ion to fatal disease must await the completion of our testing. The majority of cases of non-A, non-B 60 s may contribute to more aggressive disease (25). Our univariate analysis supports the finding of MaThis tendency is well documented in FHC, contrasting notably with MCB-M, where our study does not appear at all. BNC-WAS, however, provides no less than 171 occurrences of this phrase, and confirms its use in concluding research papers, as well as in a number of recurrent collocations. Almost 60% of the BNC-WAS instances take the form in our study, and there are also clear patterns of associated verbs: suggests, shows, showed, found. While several of these also occur with our study in FHC, none does so more than once, making it impossible to draw conclusions.
Here again, we find that the larger corpus provides the user with a source of further evidence, a means of testing hypotheses and investigating more detailed patterns with respect to the smaller one. But it also offers other advantages. It shows how far a feature may be generally applicable. Insofar as learners' precise future needs are unpredictable, such more generally applicable knowledge is relevant to them, just as it may be relevant for them to be aware that certain features are not more generally applicable, but restricted to a specific domain. The majority of ESP learners come from an EGP background, and one of their requirements is to understand the extent to which their previously-acquired knowledge of the language can be applied in the ESP context. Access to a large general corpus (such as the BNC) allows teachers and learners to compare uses in different text-types, thereby making for an awareness of the distinctive and shared features of the specific domain being investigated. To cite Halliday,
if we recognise departure from a norm, then there has to be a norm to depart from.
If we characterise register variation as variation in probabilities, as I think we must,
it seems more realistic to measure it against observed global probabilities than
against some arbitrary norm such as the assumption of equiprobability in every
case.
(Halliday 1992: 69)
A large balanced corpus provides an opportunity for teacher and learner to relate the characteristics of a specific text-type, as revealed by a specialised corpus, to more general characteristics of the language, seen as a whole.
To illustrate this point, let me further focus on the use of study as a noun. The concordance in figure 12 shows 30 randomly selected citations out of a total of 18,967 in the entire BNC. What appears from these citations is that as a noun, study has three main uses, the most frequent being that of a specific research project. Less frequent uses denote a general area of research ("the historical study of coins", "the study of literature/animal perception/experimental philosophy), and the activity of students in an institution ("study courses", "programmes of study"):
Fig. 12. Study as a noun in the BNC
1 our knowledge of the past. Since the historical study of coins began, in the Italian Renaissance, man 2 Williams made similar deductions following a study carried out in the United States.| All three wr 3 Ramsden and Snee used data from the DHSS Cohort Study. The Cohort Study interviewed a group of unempl 4 CASE STUDY: NEW DIVISION| The division was formed two year 5 The principal conclusion of the study was that all three of the series of square-head 6 eived less attention than the motor aspect, the study of animal perception (as opposed to sensory phy 7 ful that finding was obtained to set up a panel study. A sample of 5,362 of the original children was 8 ugh the BMA's anxieties were not supported by a study carried out by the Department of Health's Advis 9 wever a ž type error cannot be excluded in this study because the prestudy assumption was that microa 10 tion come mainly from a participant observation study at Oxford United and from conversations at a 11 67). Sprat suggested that, of all pursuits, the study of experimental philosophy was most likely to e 12 is acceptable.| In the case of the eye movement study, subjects were presented with an array of four 13 ormation on reactions to imprisonment. In a USA study, Tittle found that women associated with a best 14 m down as much as she could, by semi-obligatory study courses, and quasi-essential trips to the conti 15 4.3.6 The rural hinterland study | This part of the Belfast project does not hav 16 and Vagg, 1988; Waddington et al., 1990). In a study carried out by Morris and Heal (1981), resident 17 here some of its positive implications for the study of literature. 18 on reactions to Chernobyl. Award Title: A case study of consumer acceptance of new technology in a b 19 tive dissemination of research results from the study). Award Title: Choice in scientific work: A com 20 seemed to support notions of repression. In the study (Levinger & Clark, 1961) emotional words were s 21 unts of chemical elements in rocks. Geophysics: study of the Earth by quantitative physical methods. 22 2-term course on Theory and Methods of Literary Study or Theory and Criticism and two l-term courses 23 t entirely supported.| For example, in an ARTEP study of export zones in Sri Lanka, South Korea, the 24 ortant features in patients with DU. and in our study we found that these were both increased in the 25 d for a significantly greater proportion of the study time than in normal controls. This is consisten 26 ilirubin and were included consecutively in the study. Informed written consent was obtained before t 27 ients had surgery during the second year of the study; two with total ulcerative colitis had a colect 28 ded from the date of diagnosis. In the Survival Study, children could move from one treatment group t 29 n's High School and Woodfarm High School.| Case study - Langside College, Glasgow 30 ese regulations are effective for programmes of study starting in and after September 1990.
This kind of analysis potentially allows teacher and learner to contextualise uses encountered in a small specific corpus against a broader linguistic background. By revealing patterns not present in the small corpus, it highlights potentially distinctive characteristics of the latter. At the same time it allows other meanings and uses to be noted and incidentally learnt. And here we return to the procedural issue. The process of negotiating meaning in such cases engages users in dealing with unknown and unforeseen text, as they will inevitably have to. There may, of course, be considerable problems in interpreting data drawn from such a wide spectrum, whose content is likely to be less familiar and less predictable to the user (Aston 1997). But there is nothing to stop such data being used selectively. The teacher can edit concordances of this kind before passing them to the learner; learners can focus on those lines of which they are able to make sense, and which appear to be of relevance to their interests. Faced with the concordance above, ESP learners might simply be asked to identify which of these citations use study in the sense of a piece of research - an activity which engages and develops procedures of disambiguation based on collocational and colligational patternings.
Whether the aim is to develop declarative or procedural knowledge which are relevant to the learner, or both, it seems to me that there are thus strong arguments for not confining our attention to any one type of corpus. Small specific corpora have obvious virtues in highlighting recurrent specialised features, but only larger and more general ones seem able to capture less specialised ones, and to contextualise such features against a broader spectrum of abilities and awareness. How, in a practical pedagogy, such multiple resources can be exploited to maximum effect is a further methodological issue which goes beyond the limits of this brief discussion.