SInclair Corpus Concordance Collocation. Ceng Zeng. Evaluating- instances Introduction For the last four chapters, we have been studying concordances in one form or another.

Each instance has been taken to be as important as any other, and has had to be accounted for. This is a valuable discipline, but only the very first step towards the automation of text study. In this chapter and the next, tve begin to evaluate concordances and devise new kinds of information about language. The vast majority can be safel! Throw away your evidence The policy of discarding examples, and particularly examples which do not fit a description, is likely to have to struggle for popularity in linguistics.

The Cult of the Counter-erample is still very strong, in myth if not always in observance, and it is important for students of text to define a careful position in this regard: xvhich will be quite different from that of students of sentences. Hardly millio. One is forced to conclude that the authors were corpo. In the LOB corpus, for? Hence, it is important to fix 1,I. There a. If that procedure is merous Instances of granlmatical words, SUIFficient to enable conven- adopted in language work, it soon becomes necessary to acquire very tional grammatical staternents to t e made.

They are genera Ily agreed positions on English grammar. The new evidence creative, or expedient, or casual, or confused; or they have unusual sugges. Th e langua! Llngulscs nave nad to c,. Many words have more than one meaning, sense, or usage, and The:great dicti onaries o,f English Ltsed humsIn belngs t e their these occur in very uneven distribution. CllL a IE I. Fc v-. If we divide and number senses in the conventional - - dictionary manner, we may discover a statistical relationship between prehensive Grammar of the Englis,b Language CGEI- Quirk et al.

The common. So if we need, say, fifty occurrences , a eofawo fornner choicc:characterized the field linguists in the first half of this order to describe it thoroughly, then the corpus has to b i lalxc I,,",.

But wherever this limit is 1fixed, we shall uld exem'plify the dominant: structur;al patterns of the language observe tluge disc1. The mass and this will prod1uce a heavy demand for very long te, CtS. Jucll '" -. Language i! This iis clearly ;an inadequate bodied in Saussure's langue and parole or Chomsky's competence and point of view, because we do not en d up wit1I anything like text by performance. The existence of these dichotomies is t o allow us to 'generating' word strings from grammars.

In particular,. If text in- insulate the abstract system. I have already conceded that some cluding, and in particular, spoken text is not a strict realization of proportion of the complexity of text may be attributable t o accidental - meaningful abstract decisions, then either it is subject to random or random factors, but that is far from sufficient explanation.

It may distortior1, or it is il1part the result of decisions which are not recorded indeed have obscured what actually goes on. In fact, the main sim- in the abstract systc:m, but wlhich take precedence over those which are. It is merely the decoupling of lexis and syntax. Random factors wiill certain! For example, Actual text will always be deviant vvith respe ct t o strucltural rules of it is rare for a grammar to note that a certain structure is only the conventional kind.

In contrast, grammars attribute independent meaning t o inattention, confusion, and the need Ito expres! Another SYlltactic anrangementS.

Pedagogical dicticjnaries are. The implicit abject is c bvious. But if evidence If sense and structure are not independent of each other and not accurriulates to suggest that a substantial proportion of the language irlseparable:, then they must be associated.

Here we can frame a hypothesis descri ption is of this mixed nature. We can tllat can. The evidc:rice now becomin,g availab. A phrase can be Our descriptive task then becomes the identification of the regular defined for the moment as a co-occurrence of words which creates a and typical associations, leading to the identification of one or more sense that is not the simple combination of the sense of each of the 'citation forms' for each distinct sense.

The distinguishing features of words. One is first strucl by the fi xity and I:egularity of phrases, then the citation forms could then be stated, and explanations could be by their flexibility and v;xriability, then by t.

I propose to outline. In the spirit of the preceding argument, which gives promise of valuable results. The same principle can be I shall define structure as any privileges of occurrence of morphemes; we do not in the first analvsis have to decide whether these are lexical applied to other structural features. The procedure begins with a machine-generatedconcordance to a large Is it then best to hypoth esize that sense and structure are insep;arable?

The usual kind Unfor tunately Inot. I retrieved, each in the middle of a line of text. A line of text may contain as simplt:st case, by the same word. I f it is much more than incidental , , L ,A, , , ,. In or1der to d o ,this, a list is compiled in frequency order, see it as a sporadic and almost accidental coincidence of I1,.

These are called the. The word was chosen as being of tlhe node. At present we are e, cperiment:ing with t:nvironml ents of fairly frequent over 1, occurrences in 7. It was found that the first I. There is no point in considering very infrequent collocates, and currences in the 50 most typical.

The next pass, omitting the 1 4 there is usually a long tail to the frequency lists. A suitable cut-off phrases, identified a major sense which was strongly associated with point-for example, less than ten per cent of the frequency of the preceding the, occasionally his or her, and with words like first, third, node-should be d e t e:mined.

The next pass c. A number of similar in- pus. Ltes, by a dding up the weightings of each collocate The two main meanings of second, then, are associated one ment. Thc2 concordance is now re-sorted into an order with definiteness and the other with indefiniteness.

This is at least and the most typical instances should come to the as important as the observation that one is a modifier and one a noun. There shed, and the study continue:s largely on a subjective basis.

First, is, however, a third fairly prominent use of second which does any obvious phr ases are iclentified and removred. Next, there is a search not emerge in the collocational analysis. This requires neither a for the clusterin g of collocates and their mu,tual attra,ction and repul- definite nor an indefinite determiner, and the word functions as sion, f or examFJe, pairs and group s of col locates wrhich frequently a discourse organizer.

It is quite often preceded by and. It is not ,. No doubt a study an atte:mpt is made to isolate a sense, using explicit criteria. When a of seconcily would identify the third sense of second as a discourse sense i:s fully described, all the lines that exemplify it are then removed,..

The study of Gradually, this procedure should identify the distinct senses of a ' seconds nnight - add new features and new uses, and so on. In due course, we shall Isee. The t echnique has at least managed to isolate the most word. Each cycle will, however, reduce the size of the remaining basic contrast of m leaning. The model of a highly generalized formal syntax, with slots into which fall neat lists of words, is suitableonlv in rare uses -. By far the majori?

Most everyda y words do not h ave an independent meaning, or meanings, but are components or a rich. This chapter concludes the description of word co-occurrence as we repertoire of multi-word patterns that ma1te up text. This is totally currently conceive it.

I I collocations, and the project is in hand Sinclair et al. Two models of interpretation It is contended here that in order to explain the way in which meaning arises from language text, we have to advance two different principles of interpretation.

One is not enough. No single principle has been advanced which accounts for the evidence in a satisfactory way. The two principles are. The open-choice principle This is a way of seeing language text as the result of a very large number , of complex choices.


Corpus, concordance, collocation / John Sinclair.

Toggle navigation. Corpus, Concordance, Collocation John Sinclair John Sinclair charts the emergence of a new view of language and the computer technology associated with it. Developments in computational linguistics over the past ten years are outlined. There is discussion of corpus creation and exemplification of corpus use.



