Concordancing with Language Learners: Why? When? What?

Vance Stevens, Courseware Publishing International

Published in CAELL Journal, vol 6 #2, Summer 1995 pp. 2-10


A history of concordancing in literature and linguistic analysis begins well before computers were brought to the task. Tribble and Jones (1990) briefly trace the history of concordancing from the 13th century, when Hugo de San Charo enlisted 500 monks in producing a complete concordance of the Latin Bible. Fortunately, current applications of concordances in applied linguistics are not nearly so labor intensive.

The use of concordancing as a tool for language learning is a much more recent phenomenon, falling more in the era of the 1980's, when computational power began to get scaled into small, affordable personal computers that have since appeared on the desks of even the humblest of teachers. Skeehan (1981) mentions concordancing in relation to ESP, but only briefly, and more in the context of mainframe or minicomputer applications; though he does assess its potential for ESP as "considerable". At a time when contemporary books on CALL rarely mentioned the topic, Higgins and Johns (1984) presented one of the first practical discussions of concordancing in language learning, while Ahmad et al. (1985) only mentioned concordancing as a means of exploiting text (pp.126-7). In 1987, Goethals presented a paper at the AILA conference in Sydney in which concordancing was reported to have been used in the preparation of CALL-based vocabulary materials (leading, I believe, to the production of Adam & Eve). As a matter of perspective, Jones and Fortescue (in the same year, 1987) didn't mention concordances at all, while two years later, Hardisty and Windeatt (1989) included several activities using concordancers in their CALL resource book. Obviously, by the latter date, a popular awareness of the power of concordancing in language learning was beginning to emerge.

To paraphrase Stevens (1993), language teachers fall into three groups: those who have never heard of concordances, those who haven't yet taken them seriously, and those who swear by them. Many teachers in the latter group feel that concordancers are the type of software that most closely approaches fulfilling the potential of computers in language learning. In a sense, they are working approximations of expert systems. They bring cognitive and analytic skills in students to bear on the manipulation of comprehensive databases for the purpose of solving real-language problems.


Why does Johns (1988) use concordancing in language learning? First, it interjects authenticity (of text, purpose, and activity) into the learning process. Second, learners assume control of that process. And third, the predominant metaphor for learning becomes the research metaphor, as embodied in the concept of data-driven learning (DDL), which builds learners' competence by giving them access to the facts of linguistic performance. As Johns puts it (1991a), "we simply provide the evidence needed to answer the learner's questions, and rely on the learner's intelligence to find answers." (p. 2) Support for this approach can be found in McDonough (1986:31) who points out that with deductive methods of learning grammar: "First, it is unlikely that the students can understand the rule statement until they have tested it against the various examples ... Secondly, [giving a rule first] imposes a rule formulation rather than encouraging the student to make one up in his own terms. In cases where the discrimination necessary is relatively simple, an imposed classification is usually less easy to remember and therefore less efficient than one invented for oneself." With DDL, the learner's own discovery of grammar based on evidence from authentic language use becomes central to the learning process. Yet DDL is distinct from other inductive models of learning in that the teacher facilitates student research into the language without knowing in advance what rules or patterns the learners will discover.

Aside from these benefits, concordancing is efficient, and the potential for innovation is great. In Stevens (1990:5) I characterize concordancing as one form of text manipulation and pointed out that it is "economical in terms of time to implement because it requires only a program plus a text base, where the text base could be the concatenated sum (or subset, or superset) of all the texts used for text reconstruction. Because text reconstruction and concordance programs could easily feed off the same text base, they can be combined." As an example, I cite Cobb (1992) who uses concordances as a form of help in cloze exercises. When students want to know more about the word in a gap, they can see a concordance of that word used elsewhere in the text base, but with the word itself masked.

Many other novel uses for concordancing in language learning will be cited further on. But first, I will mention where you can get concordancers, the mill, and their grist, i.e. lots of text.


Concordancers are at everyone's fingertips. Pienemann and Jansen (1992:208) point out that "the function of a concordance program is included in standard database systems," and Rezeau (1988) reports ways of exploiting Ashton Tate's Framework for concordancing (as per Tribble, 1990). Simpler still, Stevens (1991a) reveals the following DOS command which will, when run as a batch file with two parameters (search-string and output device) concordance the text contained in the files specified within parentheses and send the data to screen, file or printer:

for %%a in (*.asc) do find "%1" %%a >> %2

Thus there is no reason why an MS-DOS using educator should be without this simple yet powerful tool.

Of course, the preferred option is to buy a commercially available concordancer, and the contrastive review of commercially available concordancers in Higgins (1991a) provides an excellent overview of these. A concordancer intended for use with students should be fast and responsive. It must load quickly, so as to not waste valuable class time simply accessing its database. Sorts on output should be instantaneous and permitted both one-over and proximally left and right of the key word. Because second-language learners frequently make mistakes with string specifications, loading and query should be interruptable at any point, and the user should be able to work with the portion of the corpus or concordance processed up to that point if desired (Cobb, 1994, has implemented an interesting idea: process the text for each unique string and then limit learner choice to exactly those strings). Keyword selection should allow spaces, to allow searching strings comprising more than one word. It also helps if more than one keyword can be searched at one time (e.g. both 'could' and 'would'), if Boolean operators can be used ('could' or 'would' but not 'mould') and if keywords can contain wild cards (e.g * to stand for any number of characters; ? to represent a single character; hence '?ould').


Machine-readable text is ubiquitous; so there is little to prevent the steady accumulation of material towards building a sizable corpus of text. An obvious source is the Internet, with its wealth of text that can be downloaded. It is also possible in the USA to use closed caption television, especially now that closed caption has been mandated for all TV sets sold there, and to send transcripted text to disk using a transcoder such as that produced by Pacific Lotus Co. Tribble (1990) notes the benefit of using student texts. Stevens (1991b) suggests using materials created for students on word processors by teachers in various departments in the educational environment (over a quarter million words of such text as well as science lectures transcribed in the course of research by John Flowerdew, Roger Griffiths, and Chris Arden-Close have been made available through the MS-DOS Users' Group, CALL-IS, TESOL). Norm Johnson (personal communication) has suggested scanning in U.S. Government Printing Office documents. Titles of uncopyrighted publications are listed in the periodically issued New Books: Publications for sale by the Government Printing Office, available from the Superintendent of Documents, US Government PO, Washington, D.C. 20402. Finally, there is text available on CD-ROM, such as that which can be downloaded from ENCARTA, which could be exploited for concordancing purposes.

For teachers in need of ready made corpora, MICROCONCORD (Oxford University Press) is supplied with an optionally purchasable corpus of business and general English texts, a million words each. Some other examples of specially prepared corpora are the Oxford Tape Archive sited in Louw (1991), and the Brown University Corpus and the Lancaster-Oslo/Bergen (LOB)Corpus discussed in Sampson's (1992) survey of machine-readable analysed corpora of English. In addition, the Longman / Birkbeck Corpus of Learner English (ascii text annotated for text type, L1, nationality, level, etc.) is mentioned in Tribble (1989).


One of the most interesting aspects of using concordances with students is the insights into the language this grants teachers as well as students. Sinclair (1986:202) notes that such computer tools "challenge our current linguistic descriptions quite fundamentally", freeing linguists from reliance on intuition and enabling them to "find explanations that fit the evidence, rather than adjusting the evidence to fit a pre-set explanation."

Ready availability of concordance tools has recently spawned a rash of research into features of English, one of many languages studied in this way. Fox (1991), for example, produces evidence from the COBUILD project to suggest that verbs are rarely absolutely transitive or intransitive in English, and that attempting to classify them as such is misleading. Louw (1991) mentions that one result of work with COBUILD has been the Lexical Syllabus, the principle behind which "is that if the top 2000 or so most frequent words in English are systematically taught in all of their forms and in well-structured materials, they will carry with them most of the grammatical and discourse detail that second and foreign language learners are ever likely to need." (p.152; see Sinclair & Renouf, 1987, for more on the Lexical Syllabus). Louw has used concordances to study progressive delexicalization, the phenomenon by which words tend to lose their "dictionary" meaning; e.g. 'take the money', 'take a bus', 'take a look'.

Tribble (1991) used concordances to help distinguish between informal and formal register by looking at noun phrases and post-modification, verb phrase structure, and theme/grammatical structure. Garton (1991) used transcripts of teacher talk to discover that of 53 questions (discerned by concordancing the string '?'), only 16 were meant to elicit information; the others were rhetorical or for nominating, reformulating, or confirming student responses. Higgins (1991b) used a concordancer to find out what analogous forms there are to 'the *ing of' in order to discover what linguistic data may have prompted one of his students to produce the phrase 'the remaining of his army'.

In the remaining of this article, I would like to focus on what concordances can teach students (and their teachers) who use them in the classroom.


Concordancers are certainly not tools that computer novices can be turned loose on without proper preparation beforehand. In many instances, both students and teachers must be made aware of the methodological considerations underpinning use of such software. Inherent limitations in the database are rarely intuitively understood. Why, for example, should the word 'potential' never occur in a corpus of biology readings, yet occur repeatedly in a corpus of physics texts, always as a property of energy? The relationship between raw data and output is not obvious to all, and the very existence of the text base, its particular bias, and its relevance to the students must all be explained and emphasized. Formulation of productive queries is particularly difficult for language learners, who may need assistance until they have become familiar with the technique. Misspellings which spoil productive searches are common, and successful use of wild cards requires near-native competence in anticipating word derivations. It is also difficult for language learners to independently phrase queries so that they will expose subtle patterns in the language. Such patterns will likely have to be pre-considered by the teacher/facilitator, and until students have got the hang of concordancing, heuristics for getting at patterns will likely have to be worked out in advance and spelled out to students as well.

Stevens (1993:11) highlights the complexity of such a heuristic for studying conditionals through concordance of the string 'if': "As the verb in the 'then' clause might either precede or follow at some distance from that of the 'if' clause, students need to know how to extract more than one line of context surrounding the target string. And they should be warned that the 'then' portion of the conditional is sometimes implied and thus impossible to locate in the corpus. As a final exercise, students could be asked to compare their findings with the formulas for conditional sentences often taught in grammar books. The discrepancy is sometimes so large as to render further use of the grammar book potentially embarrassing. Obviously, you wouldn't want to use such insightful tools if you feel that your 'teaching' is more valid than your students' 'learning'." Similar observations have prompted Johns (1991b:31) to conclude that "the description of English underlying our teaching ... needs major reassessment."

Toward such reassessment, Johns (1991a) convincingly (persuasively?) addresses the question: "Teacher, what is the difference between 'convince' and 'persuade'?" The concordancer finds that 'convince' is usually followed by a 'that' clause, and 'persuade' by 'to', leading one student to note that 'that' clauses were more factual, which seemed to Johns a better explanation that his own. A second exploration was prompted by a student who sometimes found 'should's that aren't real 'should's. These turned out to be factive 'should's ('it is surprising that the New Scientist should have convinced itself ...'). Another insight was that epistemic and deontic 'should's both tell what will happen if the rules are followed, which Johns thinks is why they translate the same in many languages.

Tribble (1989) uses concordance output to determine and then teach differences between native speaker and non-native speaker usage of "procedural lexis". Tribble made a list of all words occurring at least 5 times in a native-speaker corpus ("less than five gets unwieldy." p.12) and extracted all non-content items from this list. The same words in a NNS corpus were concordanced and differences found in use of words such as 'however', 'moreover', 'nevertheless', 'thereby', 'therefore', 'thus', 'while', and 'yet'. For example, only 25% of the NS use of 'however' was sentence initial, while 81% of the 48 'however's in the NNS corpus were. "This type of proportional differentiation was maintained across all of the samples drawn from the two text corpora and was supported by a concordance of the same items run on the small Queen Mary College file of student academic scripts" (p.13). Tribble recommends having students do the same kind of analysis: "I have already found it particularly fruitful to give students the opportunity to engage in the sort of analysis considered in this paper. By giving students the responsibility for assessing and constructing models for effective academic or formal writing I have found a way of avoiding the imposition of prescriptive and (frequently) inappropriate modes of expression and creating an enhanced awareness of the meaning potential of English as well as bringing about an improved performance in writing tasks" (p.13).

Tribble (1990) presents many other ideas for classroom concordancing, including the investigation of:

- full stops for insights into thematic structure of text;

- specific nouns for range of adjectival collocations;

- *LY for all derived adverbs and information on verb/adverb word order

- forms of 'come';

- UN*ED ('unfinished', 'unchanged' etc.);

- 'there is/are/seems'.

Exercises are also described where students do a contrastive study of tricky prepositions. An exercise on articles leads to a flowchart determining article use and application of the same chart on concordance output which has blanks before nouns. Tribble claims (anecdotally) that this technique leads to gains with Thai and Punjabi students in particular, and allows students and teachers to discuss problems in a shared metalanguage. Tribble concludes that the concordancer (simple to use, requiring no authoring, and having an exceptionally wide range of applications) "will perhaps be the pre-eminent software tool in this next stage in the development of computer assisted language learning" (p.15).

Stevens (1991b) details a technique for creating vocabulary exercises from concordance output (the effectiveness of which is studied in Stevens, 1991c). Essentially the same technique, independently derived, is reported in Butler (1991), who reports also that such exercises are a quick and reliable means of assessing expectancy grammar for placement of students (Oller, 1976; Feldmann and Stemmer, 1987). Regarding the preparation of such exercises, Johns (1991b) warns that bias can enter into concordance data if the materials preparer selects data based on preconceived notions of what ought to be there, or on pedagogic grounds (e.g. includes only self-contained or the most illustrative examples).

Higgins (1991b) says that concordancing accounts for "well over half" the computer work he does with students. He mentions productive work on explorations of "LISTEN and HEAR, TAKE BRING and FETCH, ANY, *CAUSE, and of course now with *ING" (p.6). Another project is with *ATE homographs that can be pronounced /eit/ or /it/ depending on part of speech (he doesn't mention *USE homographs that can be voiced or not as verbs or nouns). Elsewhere, Higgins suggests discovering, of all words in English ending in '-id', what percent have negative connotations (1991c). Similar practical advice is found in Tribble and Jones (1990), who suggest several possible search patterns; e.g.

- on sentence endings for theme/rheme patterns

- on semantically related "headwords"

- on ??????ed for words like 'invested', 'deprived', 'collated', 'achieved', 'despised', 'destined', 'detailed', 'returned'

- on *???ing to rule out 'bring' and 'sing'

- on phrases such as 'while this is' or 'interest in'

- on collocations like 'pretty ... tired/well/uncomfortable

- on 'it ... that'

- and on 'there ... (is/are/were/being/be/been)'.

Concordancing has been used effectively in ESP. Ilse (1991) for example reports on a vocational training lesson in which banking students concordanced words like share*, debenture*, unit* etc. and then compiled the information in text files. Students reported results to the class at large, and found the concordance program "fascinating."

Mpartusa et al. (1991) had students study English by means of concordances of textbook materials in different subjects. In economics, searches were done on common Greek and Latin roots, locating definitions, and discovering differences in American and British usage. With geology texts, concordance output supported analysis of texts in terms of a "cycle" of process in the following way: concordances were started on caus*, moved on to verbs of process, and then analyzed in terms of a model ('as a result' was identified as problematic because it tends to have the cause in the previous sentence). With philosophy texts, students queried the database for abstract objects. A "keen" volunteer trialed tasks and became a tutor when it came time for students to perform tasks, such as distinguishing when words were used in a technical or everyday sense; deciding what marked definitions or when concepts were dealt with generally vs. specifically; and locating words denoting categories and their components.

One aim of the project was to determine how Zimbabwean students, whose exposure to research and technology is minimal, would do with concordancing. At the start of university training, these students' readiing was "rarely analytical, and frequently unreflective" and they "view the textbook as a monolithic authority, and read it in a linear fashion." (p.130) The concordancer "challenges the role of a set text in the learning process. The text shifts from being an inviolable authority to something which students can question, explore and hopefully come to understand." It provided a "window" of access to text structure, allowing students to interact with text actively and analytically. It "liberated the students from a linear approach to reading, and assisted them to see patterns which might be in counterpoint to the linear progression of the text [and] their texts as resources to be explored for individual learning ... the potential for transfer of the skills developed in the concordancing sessions to regular individual study appeared high." Once they'd seen how to use the program, students "set their own agendas for use of the program, as illustrated in the case studies." (p. 131) Motivation ran high, and peer learning was encouraged.


These examples support Higgins's (1991b:5) contention that "What is slowly becoming clear is that the most valuable contribution a computer can make to language learning is in supplying, on demand and in an organised fashion, masses and masses of authentic language. ...The most powerful of these tools is a concordancer." (p.6) Accordingly, Stevens (1993:11) maintains that "with concordance software and a corpus of natural English, language learners can short-cut the process of acquiring competence in the target language, because the computer is able to help students organize huge amounts of language data so that patterns are more easily discerned." Tribble (1990:11) sums it up quite nicely: "What the concordancer does is make the invisible visible."

Why then isn't concordancing more widely used by language learners and their teachers? I think the conclusion to Stevens (1990:8) bears repeating and sums up this article as well:

"Predisposition to text manipulation requires acceptance of the notion that language learners can benefit from teaching materials promoting inductivity, authenticity, and learner responsibility for learning. Whereas these ideas underpin current language methodologies, teachers may tend toward traditional ways of instruction, especially when change involves massive retooling and when students seem most comfortable with traditional roles. Particularly where technology is involved, there is much ignorance, misunderstanding, and "indifference" to putting into practice new approaches to language teaching while acquiring new skills in operating complex hardware and software. Although text manipulation is conveniently implemented and consistent with current language learning pedagogy, its benefits are difficult to intuit; hence the genre is easily misunderstood. Education of teachers and students on their roles and responsibilities in learning, and the relationship of these to CALL, is a desirable solution to this problem."


