15 Language Corpora

Contents

The term language corpus is used to mean a number ofrather different things. It may refer simply to any collection oflinguistic data (written, spoken, or a mixture of the two), althoughmany practitioners prefer to reserve it for collections which havebeen organized or collected with a particular end in view, generally tocharacterize a particular state or variety of one or more languages.Because opinions as to the best method of achieving this goal differ,various subcategories of corpora have also been identified. For ourpurposes however, the distinguishing characteristic of a corpus is thatits components have been selected or structured according to someconscious set of design criteria.

These design criteria may be very simple and undemanding, or verysophisticated. A corpus may be intended to represent (in thestatistical sense) a particular linguistic variety or sublanguage, orit may be intended to represent all aspects of some assumed‘core’ language. A corpus may be made up of wholetexts or of fragments or text samples. It may be a‘closed’ corpus, or an ‘open’ or‘monitor’ corpus, the composition of which maychange over time. However, since an open corpus is of necessityfinite at any particular point in time, the only likely effect of itsexpansibility from the encoding point of view may be some increaseddifficulty in maintaining consistent encoding practices (see furthersection 15.5 Recommendations for the Encoding of Large Corpora). For simplicity, therefore, ourdiscussion largely concerns ways of encoding closed corpora, regardedas single but composite texts.

Language corpora are regarded by these Guidelines ascomposite texts rather than unitary texts(on this distinction, see chapter 4 テキスト構造モジュール). This isbecause although each discrete sample of language in a corpus clearlyhas a claim to be considered as a text in its own right, it is alsoregarded as a subdivision of some larger object, if only forconvenience of analysis. Corpora share a number of characteristicswith other types of composite texts, including anthologies andcollections. Most notably, different components of composite textsmay exhibit different structural properties (for example, some may becomposed of verse, and others of prose), thus potentially requiringelements from different TEI modules.

Aside from these high-level structural differences, and possiblydifferences of scale, the encoding of language corpora and theencoding of individual texts present identical sets of problems. Anyof the encoding techniques and elements presented in other chapters ofthese Guidelines may therefore prove relevant to some aspect of corpusencoding and may be used in corpora. Therefore, we do not repeat herethe discusssion of such fundamental matters as the representation ofmultiple character sets (see chapter vi 言語と文字集合); nor do weattempt to summarize the variety of elements provided for encodingbasic structural features such as quoted or highlighted phrases, crossreferences, lists, notes, editorial changes and reference systems (seechapter 3 コアモジュール). In addition to these general purposeelements, these Guidelines offer a range of more specialized sets oftags which may be of use in certain specialized corpora, for examplethose consisting primarily of verse (chapter 6 韻文),drama (chapter 7 Performance Texts), transcriptions of spoken text(chapter 8 Transcriptions of Speech), etc. Chapter 1 TEIの基礎構造should be reviewed for details of how these and other components ofthe Guidelines should be tailored to create a document type definitionappropriate to a given application. In sum, it should not be asssumedthat only the matters specifically addressed in this chapter are ofimportance for corpus creators.

This chapter does however include some other materialrelevant to corpora and corpus-building, for which no other locationappeared suitable. It begins with a review of the distinction betweenunitary and composite texts, and of the different methods provided bythese Guidelines for representing composite texts of different kinds(section 15.1 Varieties of Composite Text). Section 15.2 Contextual Information describes aset of additional header elements provided for the documentation ofcontextual information, of importance largely though not exclusively tolanguage corpora. This is the additional module for language corporaproper. Section 15.3 Associating ContextualInformation with a Text discusses a mechanism by whichindividual parts of the TEI Header may be associated with differentparts of a TEI-conformant text. Section 15.4 言語学的アノテーション of Corpora reviewsvarious methods of providing linguistic annotation in corpora, with somespecific examples of relevance to current practice in corpuslinguistics. Finally, section 15.5 Recommendations for the Encoding of Large Corpora provides some generalrecommendations about the use of these Guidelines in the building oflarge corpora.

15.1 Varieties of Composite Text

Both unitary and composite texts may be encoded using theseGuidelines; composite texts, including corpora, will typically makeuse of the following tags for their top-level organization.
  • teiCorpus contains the whole of a TEI encoded corpus, comprising asingle corpus header and one or more TEI elements, each containinga single text header and a text.
  • TEI (TEI document) contains a single TEI-conformant document,comprising a TEI header and a text, either in isolation or as part of ateiCorpus element.
  • teiHeader (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.
    typespecifies the kind of document to which the header is attached, for example whether it is a corpus or individual text.
  • text contains a single text of any kind, whether unitary orcomposite, for example a poem or drama, a collection of essays, a novel,a dictionary, or a corpus sample.
  • group contains the body of a composite text, grouping together asequence of distinct texts (or groups of such texts) which are regardedas a unit for some purpose, for example the collected works of anauthor, a sequence of prose essays, etc.
Full descriptions of these may be found inchapter 2 TEIヘダー (for teiHeader), and chapter 4 テキスト構造モジュール (for teiCorpusTEI, text andgroup); this section discusses their application to compositetexts in particular.
In these Guidelines, the word text refers to any stretchof discourse, whether complete or incomplete, unitary or composite,which the encoder chooses (perhaps merely for purposes of analyticconvenience) to regard as a unit. The term composite textrefers to texts within which other texts appear; the following commoncases may be distinguished:
  • language corpora
  • collections or anthologies
  • poem cycles and epistolary works (novels or essays writtenin the form of collections or series of letters)
  • otherwise unitary texts, within which one or more subordinatetexts are embedded
The elements listed above may be combined to encode each of thesevarieties of composite text in different ways.

In corpora, the component samples are clearly distinct texts, but thesystematic collection, standardized preparation, and common markup ofthe corpus often make it useful to treat the entire corpus as a unit,too. Some corpora may become so well established as to be regarded astexts in their own right; the Brown and LOB corpora are now close toachieving this status.

The teiCorpus element is intended for the encoding oflanguage corpora, though it may also be useful in encoding newspapers,electronic anthologies, and other disparate collections of material.The individual samples in the corpus are encoded as separateTEI elements, and the entire corpus is enclosed in ateiCorpus element. Each sample has the usual structure fora TEI document, comprising a teiHeader followed by atext element. The corpus, too, has a corpus-levelteiHeader element, in which the corpus as a whole, and encodingpractices common to multiple samples may be described. The overallstructure of a TEI-conformant corpus is thus:
<teiCorpus>
 <teiHeader type="corpus"/>
 <TEI>
  <teiHeader type="text"/>
  <text/>
 </TEI>
 <TEI>
  <teiHeader type="text"/>
  <text/>
 </TEI>
</teiCorpus>

Header information which relates to the whole corpus rather than toindividual components of it should be factored out and included in theteiHeader element prefixed to the whole. This two-levelstructure allows for contextual information to be specified at thecorpus level, at the individual text level, or at both. Discussion ofthe kinds of information which may thus be specified is providedbelow, in section 15.2 Contextual Information, as well as in chapter 2 TEIヘダー. Information of this type should in general bespecified only once: a variety of methods are provided for associatingit with individual components of a corpus, as further described insection 15.3 Associating ContextualInformation with a Text.

In some cases, the design of a corpus is reflected in its internalstructure. For example, a corpus of newspaper extracts might bearranged to combine all stories of one type (reportage, editorial,reviews, etc.) into some higher-level grouping, possibly with sub-groupsfor date, region, etc. The teiCorpus element provides nodirect support for reflecting such internal corpus structure in themarkup: it treats the corpus as an undifferentiated series ofcomponents, each tagged TEI.

If it is essential to reflect a single permanent organization of acorpus into sub- and sub-sub-corpora, then the corpus or the high-levelsubcorpora may be encoded as composite texts, using the groupelement described below and in section 4.3.1 複合テキスト. Themechanisms for corpus characterization described in this chapter,however, are designed to reduce the need to do this. Useful groupingsof components may easily be expressed using the text classification andidentification elements described in section 15.2.1 The Text Description,and those for associating declarations with corpus components describedin section 15.3 Associating ContextualInformation with a Text. These methods also allow severaldifferent methods of text grouping to co-exist, each to be used asneeded at different times. This helps minimize the danger ofcross-classification and mis-classification of samples, and helpsimprove the flexibility with which parts of a corpus may becharacterized for different applications.

Anthologies and collections are often treated as texts in their ownright, if only for historical reasons. In conventional publishing, atleast, anthologies are published as units, with single editorialresponsibility and common front and back matter which may need to beincluded in their electronic encodings. The texts collected in theanthology, of course, may also need to identifiable as distinctindividual objects for study.

Poem cycles, epistolary novels, and epistolary essays differ fromanthologies in that they are often written as single works, by singleauthors, for single occasions; nevertheless, it can be useful to treattheir constituent parts as individual texts, as well as the cycleitself. Structurally, therefore, they may be treated in the same wayas anthologies: in both cases, the body of the text is composedlargely of other texts.

The group element is provided to simplify the encoding ofcollections, anthologies, and cyclic works; as noted above, thegroup element can also be used to record the potentiallycomplex internal structure of language corpora. For a full description,see chapter 4 テキスト構造モジュール.

Some composite texts, finally, are neither corpora, nor anthologies,nor cyclic works: they are otherwise unitary texts within which othertexts are embedded. In general, they may be treated in the same way asunitary texts, using the normal TEI andbody elements. The embedded text itself may be encoded usingthe text element, which may occur within quotations or betweenparagraphs or other chunk-level elements inside the sections of a largertext. For further discussion, see chapter 4 テキスト構造モジュール.

All composite texts share the characteristic that their differentcomponent texts may be of structurally similar or dissimilar types. Ifall component texts may all be encoded using the same module,then no problem arises. If however they requiredifferent modules, then these must be included in the schema. Thisprocess is described in more detail in section 1.1 TEIのモジュール.

15.2 Contextual Information

Contextual information is of particular importance for collectionsor corpora composed of samples from a variety of different kinds oftext. 例 of such contextual information include: the age, sex,and geographical origins of participants in a language interaction, ortheir socio-economic status; the cost and publication data of anewspaper; the topic, register or factuality of an extract from atextbook. Such information may be of the first importance, whether asan organizing principle in creating a corpus (for example, to ensurethat the range of values in such a parameter is evenly representedthroughout the corpus, or represented proportionately to the populationbeing sampled), or as a selection criterion in analysing the corpus(for example, to investigate the language usage of some particularvector of social characteristics).

Such contextual information is potentially of equal importance forunitary texts, and these Guidelines accordingly make no particulardistinction between the kinds of information which should be gatheredfor unitary and for composite texts. In either case, the informationshould be recorded in the appropriate section of a TEI Header, asdescribed in chapter 2 TEIヘダー. In the case of language corpora,such information may be gathered together in the overall corpus header,or split across all the component texts of a corpus, in their individualheaders, or divided between the two. The association between anindividual corpus text and the contextual information applicable to itmay be made in a number of ways, as further discussed in section 15.3 Associating ContextualInformation with a Text below.

Chapter 2 TEIヘダー, which should be read in conjunction withthe present section, describes in full the range of elements availablefor the encoding of information relating to the electronic file itself,for example its bibliographic description and those of the source orsources from which it was derived (see section 2.2 ファイル解説);information about the encoding practices followed with the corpus, forexample its design principles, editorial practices, reference system,etc. (see section 2.3 符号化解説); more detailed descriptiveinformation about the creation and content of the corpus, such as thelanguages used within it and any descriptive classification system used(see section 2.4 プロファイル解説); and version information documenting anychanges made in the electronic text (see section 2.5 改訂解説).

In addition to the elements defined by chapter 2 TEIヘダー,several other elements can be used in the TEI header if the additionalmodule defined by this chapter is invoked. These additional tags makeit possible to characterize the social or other situation within which alanguage interaction takes place or is experienced, the physical settingof a language interaction, and the participants in it. Though thisinformation may be relevant to, and provided for, unitary texts as wellas for collections or corpora, it is more often recorded for thecomponents of systematically developed corpora than for isolated texts,and thus this module is referred to as being ‘for languagecorpora’.

When the module defined in this chapter is included in a schema, anumber of additional elements become available within theprofileDesc element of the TEI Header (discussed in section2.4 プロファイル解説).
  • textDesc (text description) provides a description of a text in terms of itssituational parameters.
  • particDesc (participation description) describes the identifiable speakers, voices, or other participantsin a linguistic interaction.
  • settingDesc (setting description) describes the setting or settings within which a languageinteraction takes place, either as a prose description or as aseries of setting elements.
Theseelements, members of the model.profileDescPart, are discussed in theremainder of the chapter.

15.2.1 The Text Description

The textDesc element provides a full description of thesituation within which a text was produced or experienced, and thuscharacterizes it in a way relatively independent of any apriori theory of text-types. It is provided as an alternativeor a supplement to the common use of descriptive taxonomies used tocategorize texts, which is fully described in section 2.4.3 分類, and section 2.3.6 分類法宣言. The description isorganized as a set of values and optional prose descriptions for thefollowing eight situational parameters, each represented byone of the following eight elements:
  • channel (primary channel) describes the medium or channel by which a text is delivered orexperienced. For a written text, this might be print, manuscript, e-mail, etc.;for a spoken one, radio, telephone, face-to-face, etc.
    modespecifies the mode of this channel with respect to speech and writing.
  • constitution describes the internal composition of a text or text sample,for exampleas fragmentary, complete, etc.
    typespecifies how the text was constituted.
  • derivation describes the nature and extent of originality of this text.
    typecategorizes the derivation of the text.
  • domain (domain of use) describes the most important social context in which the text wasrealized or for which it is intended, for example private vs. public,education, religion, etc.
    typecategorizes the domain of use.
  • factuality describes the extent to which the text may be regarded asimaginative or non-imaginative, that is, as describing a fictionalor a non-fictional world.
    typecategorizes the factuality of the text.
  • interaction describes the extent, cardinality and nature of any interactionamong those producing and experiencing the text, for example in theform of response or interjection, commentary etc.
    typespecifies the degree of interaction between active and passive participants in the text.
    activespecifies the number of active participants (or addressors) producing parts of the text.
    passivespecifies the number of passive participants (or addressees) to whom a text is directed or in whose presence it is created or performed.
  • preparedness describes the extent to which a text may be regarded asprepared or spontaneous.
    typea keyword characterizing the type of preparedness.
  • purpose characterizes a single purpose or communicative function of thetext.
    typespecifies a particular kind of purpose.
    degreespecifies the extent to which this purpose predominates.

These elements constitute a model class called model.textDescPart; new parameters may be definedby defining new elements and adding them to that class, as furtherdescribed in 23.2 Personalization and Customization.

By default, a text description will contain each of the aboveelements, supplied in the order specified. Except for thepurpose element, which may be repeated to indicate multiplepurposes, no element should appear more than once within a single textdescription. Each element may be empty, or may contain a briefqualification or more detailed description of the value expressed byits attributes. It should be noted that some texts, in particularliterary ones, may resist unambiguous classification in some of thesedimensions; in such cases, the situational parameter in questionshould be given the content ‘not applicable’ or an equivalentphrase.

Texts may be described along many dimensions, according to manydifferent taxonomies. No generally accepted consensus as to how suchtaxonomies should be defined has yet emerged, despite the best effortsof many corpus linguists, text linguists, sociolinguists,rhetoricians, and literary theorists over the years. Rather thanattempting the task of proposing a single taxonomy oftext-types (or the equally impossible one of enumeratingall those which have been proposed previously), the closed set ofsituational parameters described above can be used incombination to supply useful distinguishing descriptive features ofindividual texts, without insisting on a system of discrete high-leveltext-types. Such text-types may however be used in combination withthe parameters proposed here, with the advantage that the internalstructure of each such text-type can be specified in terms of theparameters proposed. This approach has the following analyticaladvantages:48
  • it enables a relatively continuous characterization of texts (incontrast to discrete categories based on type or topic)
  • it enables meaningful comparisons across corpora
  • it allows analysts to build and compare their own text-typesbased on the particular parameters of interest to them
  • it is equally applicable to spoken and written texts

Two alternative approaches to the use of these parameters aresupported by these Guidelines. One is to use pre-existing taxonomiessuch as those used in subject classification or other types of textcategorization.Such taxonomies may also be appropriate for the description of thetopics addressed by particular texts. Elements for this purpose aredescribed in section 2.4.3 分類, and elements for defining ordeclaring such classification schemes in section 2.3.6 分類法宣言. Asecond approach is to develop an application-specific set offeature structures and an associated feature systemdeclaration, as described inchapters 18 素性構造 and 18.11 素性システム宣言.

Where the organizing principles of a corpus or collection so permit,it may be convenient to regard a particular set of values for thesituational parameters listed in this section as forming atext-type in its own right; this may also be useful wherethe same set of values applies to several texts within a corpus. Insuch a case, the set of text-types so defined should be regarded as ataxonomy. The mechanisms described in section 2.3.6 分類法宣言 may be used to define hierarchic taxonomies of suchtext-types, provided that the catDesc component of thecategory element contains a textDesc element ratherthan a prose description. Particular texts may then be associated withsuch definitions using the mechanisms described in sections 2.4.3 分類.

Using these situational parameters, an informal domesticconversation might be characterized as follows:
<textDesc n="Informal domestic conversation">
 <channel mode="s">informal face-to-face conversation</channel>
 <constitution type="single">each text represents a continuously
   recorded interaction among the specified participants
 </constitution>
 <derivation type="original"/>
 <domain type="domestic">plans for coming week, local affairs</domain>
 <factuality type="mixed">mostly factual, some jokes</factuality>
 <interaction type="completeactive="pluralpassive="many"/>
 <preparedness type="spontaneous"/>
 <purpose type="entertaindegree="high"/>
 <purpose type="informdegree="medium"/>
</textDesc>
The following example demonstrates how the same situationalparameters might be used to characterize a novel:
<textDesc n="novel">
 <channel mode="w">print; part issues</channel>
 <constitution type="single"/>
 <derivation type="original"/>
 <domain type="art"/>
 <factuality type="fiction"/>
 <interaction type="none"/>
 <preparedness type="prepared"/>
 <purpose type="entertaindegree="high"/>
 <purpose type="informdegree="medium"/>
</textDesc>

15.2.2 The Participant Description

The particDesc element in the profileDesc elementprovides additional information about the participants in a spokentext or, where this is judged appropriate, the persons named ordepicted in a written text. When the detailed elements provided bythe namesdates module described in 13 Names, Dates, People, and Places are included in a schema, this element cancontain detailed demographic or descriptive information aboutindividual speakers or groups of speakers, such as their names orother personal characteristics. Individually identified persons mayalso identified by a code which can then be used elsewhere within theencoded text, for example as the value of a whoattribute.

It should be noted that although the terms speaker orparticipant are used throughout this section, it isintended that the same mechanisms may be used to characterize fictionalpersonæ or ‘voices’ within a written text, exceptwhere otherwise stated. For the purposes of analysis of language usage,the information specified here should be equally applicable to writtenand spoken texts.

The element particDesc contains a description of theparticipants in an interaction, which may be supplied asstraightforward prose, possibly containing a list of names, encodedusing the usual list and name elements, oralternatively using the more specific and detailed listPersonelement provided by the namesdates moduledescribed in 13 Names, Dates, People, and Places.

For example, a participant in a recorded conversation might bedescribed informally as follows:
<particDesc xml:id="p2">
 <p>Female informant, well-educated, born in Shropshire UK, 12 Jan
   1950, of unknown occupation. Speaks French fluently.
   Socio-Economic status B2 in the PEP classification scheme.</p>
</particDesc>
Alternatively, when the namesdates moduleis included in a schema, information about the same participantdescribed above might be provided in a more structured way as follows:
<person sex="2age="mid">
 <birth when="1950-01-12">
  <date>12 Jan 1950</date>
  <name type="place">Shropshire, UK</name>
 </birth>
 <langKnowledge tags="en fr">
  <langKnown level="firsttag="en">English</langKnown>
  <langKnown tag="fr">French</langKnown>
 </langKnowledge>
 <residence>Long term resident of Hull</residence>
 <education>University postgraduate</education>
 <occupation>Unknown</occupation>
 <socecStatus scheme="#pepcode="#b2"/>
</person>
An identified character in a drama or a novel may also be regardedas a participant in this sense, and encoding usingthe same techniques:49
<particDesc>
 <p>The chief speaking characters in this novel are
 <list>
   <item xml:id="EMWOO">
    <name>Emma Woodhouse</name>
   </item>
   <item xml:id="DARCY">
    <name>Mr Darcy</name>
   </item>
<!-- ... -->
  </list>
 </p>
</particDesc>
Here, the characters are simply listed without the detailedstructure which use of the listPerson element permits.

15.2.3 The Setting Description

The settingDesc element is used to describe the setting orsettings in which language interaction takes place. It may contain aprose description, analogous to a stage description at the start of aplay, stating in broad terms the locale, or a more detaileddescription of a series of such settings.

Each distinct setting is described by means of a settingelement.
  • setting describes one particular setting in which a language interaction takes place.
Individual settings may be associated with particular participants bymeans of the optional who attribute which this elementinherits as a member of the att.ascribedif, for example, participants are in different places. This attributeidentifies one or more individual participants or participant groups,as discussed earlier in section 15.2.2 The Participant Description. If thisattribute is not specified, the setting details provided are assumedto apply to all participants represented in the languageinteraction. Note however that it is not possible to encode differentsettings for the same participant: a participant is deemed to be aperson within a specific setting.
The setting element may contain either a prose descriptionor a selection of elements from the classes model.nameLike.agent, mode.dateLike, ormodel.settingPart. By default, when themodule definded by this chapter is included in a schema, these classes thusprovide the following elements :
  • name (name, proper noun) contains a proper noun or noun phrase.
    typeindicates the type of the object which is being named by the phrase.
  • date contains a date in any format.
  • time contains a phrase defining a time of day in any format.
  • locale contains a brief informal description of the nature of aplace for example a room, a restaurant, a park bench etc.
  • activity contains a brief informal description of what a participant in alanguage interaction is doing other than speaking, if anything.
Additional more specific naming elements such as orgName orpersName may also be available if thenamesdates module is also included in the schema.
The following example demonstrates the kind of background informationoften required to support transcriptions of language interactions, firstencoded as a simple prose narrative:
<settingDesc>
 <p>The time is early spring, 1989. P1 and P2 are playing on the rug
   of a suburban home in Bedford. P3 is doing the washing up at the
   sink. P4 (a radio announcer) is in a broadcasting studio in
   London.</p>
</settingDesc>
The same information might be represented more formally in the followingway:
<settingDesc>
 <setting who="#p1 #p2">
  <name type="city">Bedford</name>
  <name type="region">UK: South East</name>
  <date>early spring, 1989</date>
  <locale>rug of a suburban home</locale>
  <activity>playing</activity>
 </setting>
 <setting who="#p3">
  <name type="city">Bedford</name>
  <name type="region">UK: South East</name>
  <date>early spring, 1989</date>
  <locale>at the sink</locale>
  <activity>washing-up</activity>
 </setting>
 <setting who="#p4">
  <name type="place">London, UK</name>
  <time>unknown</time>
  <locale>broadcasting studio</locale>
  <activity>radio performance</activity>
 </setting>
</settingDesc>
Again, a more detailed encoding for places is feasible if thenamesdates module is included in theschema. The above examples assume that only thegeneral purpose name element supplied in the core module isavailable.

15.3 Associating ContextualInformation with a Text

This section discusses the association of the contextual informationheld in the header with the individual elements making up a TEI text orcorpus. Contextual information is held in elements of various kindswithin the TEI header, as discussed elsewhere in this section and inchapter 2 TEIヘダー. Here we consider what happens when differentparts of a document need to be associated with different contextualinformation of the same type, for example when one part of a documentuses a different encoding practice from another, or where one partrelates to a different setting from another. In such situations, therewill be more than one instance of a header element of the relevant type.

The TEI scheme allow for the following possibilities:
  • A given element may appear in the corpus header only, in theheader of one or more texts only, or in both places
  • There may be multiple occurrences of certain elements in eithercorpus or text header.

To simplify the exposition, we deal with these two possibilitiesseparately in what follows; however, they may be combined asdesired.

15.3.1 Combining Corpus and Text Headers

A TEI conformant document may have more than one header only in thecase of a TEI corpus, which must have a header in its own right, as wellas the obligatory header for each text. Every element specified in acorpus-header is understood as if it appeared within every text headerin the corpus. An element specified in a text header but not in thecorpus header supplements the specification for that text alone. If anyelement is specified in both corpus and text headers, the corpus headerelement is over-ridden for that text alone.

The titleStmt for a corpus text is understood to beprefixed by the titleStmt given in the corpus header. Allother optional elements of the fileDesc should be omitted froman individual corpus text header unless they differ from thosespecified in the corpus header. All other header elements behaveidentically, in the manner documented below.This facility makes it possible to state once for all in the corpusheader each piece of contextual information which is common to the wholeof the corpus, while still allowing for individual texts to vary fromthis common denominator.

For example, the following schematic shows the structure of a corpuscomprising three texts, the first and last of which share the sameencoding declaration. The second one has its own encoding declaration
<teiCorpus>
 <teiHeader>
  <fileDesc>
<!-- corpus file description-->
  </fileDesc>
  <encodingDesc>
<!-- default encoding description -->
  </encodingDesc>
  <revisionDesc>
<!-- corpus revision description -->
  </revisionDesc>
 </teiHeader>
 <TEI>
  <teiHeader>
   <fileDesc>
<!-- file description for this corpus text -->
   </fileDesc>
  </teiHeader>
  <text>
<!-- first corpus text -->
  </text>
 </TEI>
 <TEI>
  <teiHeader>
   <fileDesc>
<!-- file description for this corpus text -->
   </fileDesc>
   <encodingDesc>
<!-- encoding description for this corpus text, over-riding the default -->
   </encodingDesc>
  </teiHeader>
  <text>
<!-- second corpus text -->
  </text>
 </TEI>
 <TEI>
  <teiHeader>
   <fileDesc>
<!-- file description for third corpus text -->
   </fileDesc>
  </teiHeader>
  <text>
<!-- third corpus text -->
  </text>
 </TEI>
</teiCorpus>

15.3.2 Declarable Elements

Certain of the elements which can appear within a TEI Header areknown as declarable elements. These elements have incommon the fact that they may be linked explicitly with a particularpart of a text or corpus by means of a decls attribute onthat element. This linkage is used to over-ride the defaultassociation between declarations in the header and a corpus or corpustext. The only header elements which may be associated in this way arethose which would not otherwise be meaningfully repeatable.

Declarable elements are all members of the class att.declarable; the corresponding declaringelements are all members of the the class att.declarable
  • att.declarable provides attributes for those elements in the TEI Header which may be independently selected by means of the special purpose decls attribute.
    defaultindicates whether or not this element is selected by default whenits parent is selected.
  • att.declaring provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element.
    declsidentifies one or more declarable elements within theheader, which are understood to apply to the element bearing thisattribute and its content.
An alphabetically ordered list of declarable elements follows:
  • bibl (bibliographic citation) contains a loosely-structured bibliographic citation of whichthe sub-components may or may not be explicitly tagged.
  • biblFull (fully-structured bibliographic citation) contains a fully-structured bibliographic citation, in which allcomponents of the TEI file descriptionare present.
  • biblStruct (structured bibliographic citation) contains a structured bibliographic citation, in which onlybibliographic subelements appear and in a specified order.
  • broadcast describes a broadcast used as the source of a spoken text.
  • correction (correction principles) states how and under what circumstances corrections have beenmade in the text.
  • editorialDecl (editorial practice declaration) provides details of editorial principles and practices appliedduring the encoding of a text.
  • equipment provides technical details of the equipment and media used foran audio or video recording used as the source for a spoken text.
  • hyphenation summarizes the way in which hyphenation in a source text has beentreated in an encoded version of it.
  • interpretation describes the scope of any analytic or interpretive informationadded to the text in addition to the transcription.
  • langUsage (language usage) describes the languages, sublanguages, registers, dialects etc.represented within a text.
  • listBibl (citation list) contains a list of bibliographic citations of any kind.
  • normalization indicates the extent of normalization or regularization of theoriginal source carried out in converting it to electronic form.
  • particDesc (participation description) describes the identifiable speakers, voices, or other participantsin a linguistic interaction.
  • projectDesc (project description) describes in detail the aim or purpose for which an electronicfile was encoded, together with any other relevant informationconcerning the process by which it was assembled or collected.
  • quotation specifies editorial practice adopted with respect to quotation marks in the original.
  • recording (recording event) details of an audio or video recording eventused as the source of a spoken text, either directly or froma public broadcast.
  • samplingDecl (sampling declaration) contains a prose description of the rationale and methods usedin sampling texts in the creation of a corpus or collection.
  • scriptStmt (script statement) contains a citation giving details of the script used fora spoken text.
  • segmentation describes the principles according to which the text has beensegmented, for example into sentences, tone-units, graphemic strata,etc.
  • sourceDesc (source description) supplies a description of the source text(s) fromwhich an electronic text was derived or generated.
  • stdVals (standard values) specifies the format used when standardized date or numbervalues are supplied.
  • textClass (text classification) groups information which describes the nature or topic of a textin terms of a standard classification scheme, thesaurus, etc.
  • textDesc (text description) provides a description of a text in terms of itssituational parameters.
All of the above elements may be multiply defined within a singleheader, that is, there may be more than one instance of any declarableelement type at a given level. When this occurs, the following rulesapply:
  • every declarable element must bear a unique identifier
  • for each different type of declarable element which occurs morethan once within the same parent element, exactly one element must bespecified as the default, by means of the default attribute
In the following example, an editorial declaration contains twopossible correction policies, one identified asCorPol1 and the other as CorPol2. Since thereare two, one of them (in this case CorPol1) must bespecified as the default:
<editorialDecl>
 <correction xml:id="CorPol1default="true">
  <p> ... </p>
 </correction>
 <correction xml:id="CorPol2">
  <p> ... </p>
 </correction>
 <normalization xml:id="n1">
  <p> ... </p>
  <p> ... </p>
 </normalization>
</editorialDecl>
For texts associated with the header in whichthis declaration appears correction method CorPol1 will beassumed, unless they explicitly state otherwise. Here is thestructure for a text which does state otherwise:
<text>
 <body>
  <div1 n="d1"/>
  <div1 n="d2decls="#CorPol2"/>
  <div1 n="d3"/>
 </body>
</text>
In this case, the contents of the divisions D1 and D3will both use correction policy CorPol1, and those ofdivision D2 will use correction policy CorPol2.

The decls attribute is defined for any element which is amember of the class declaring. This includes the majorstructural elements text, group, and div, aswell as smaller structural units, down to the level of paragraphs inprose, individual utterances in spoken texts, and entries indictionaries. However, TEI recommended practice is to limit the numberof multiple declarable elements used by a document as far as possible,for simplicity and ease of processing.

The identifier or identifiers specified by the declsattribute are subject to two further restrictions:
  • An identifier specifying an element which contains multipleinstances of one or more other elements should be interpreted as if itexplicitly identified the elements identified as the default in eachsuch set of repeated elements
  • Each element specified, explicitly or implicitly, by the list ofidentifiers must be of a different type.
To demonstrate how these rules operate, we now expand our earlierexample slightly:
<encodingDesc>
 <editorialDecl xml:id="ED1default="true">
  <correction xml:id="C1Adefault="true">
   <p> ... </p>
  </correction>
  <correction xml:id="C1B">
   <p> ... </p>
  </correction>
  <normalization xml:id="N1">
   <p> ... </p>
   <p> ... </p>
  </normalization>
 </editorialDecl>
 <editorialDecl xml:id="ED2">
  <correction xml:id="C2Adefault="true">
   <p> ... </p>
  </correction>
  <correction xml:id="C2B">
   <p> ... </p>
  </correction>
  <normalization xml:id="N2A">
   <p> ... </p>
  </normalization>
  <normalization xml:id="N2Bdefault="true">
   <p> ... </p>
  </normalization>
 </editorialDecl>
</encodingDesc>

This encoding description now has two editorial declarations,identified as ED1 (the default) and ED2. For texts not specifyingotherwise, ED1 will apply. If ED1 applies, correction method C1a andnormalization method N1 apply, since these are the specified defaultswithin ED1. In the same way, for a text specifying decls as‘ED2’, correction C2a, and normalization N2b willapply.

A finer grained approach is also possible. A text might specifytext decls='C2b N2a', to ‘mix and match’ declarations asrequired. A tag such as text decls='ED1 ED2' would(obviously) be illegal, since it includes two elements of the same type;a tag such as text decls='ED2 C1a' is also illegal, since inthis context ED2 is synonymous with the defaults for thateditorial declaration, namely C2a N2b, resulting in a listthat identifies two correction elements (C1a and C2a).

15.3.3 Summary

The rules determing which of the declarable elements are applicableat any point may be summarized as follows:
  1. If there is a single occurrence of a given declarableelement in a corpus header, then it applies by default to all elementswithin the corpus.
  2. If there is a single occurrence of a given declarableelement in the text header, then it applies by default to all elementsof that text irrespective of the contents of the corpus header.
  3. Where there are multiple occurrences of declarable elementswithin either corpus or text header,
    • each must have a unique value specified as the valueof its xml:id attribute;
    • one only must bear a default attribute withthe value YES.
  4. It is a semantic error for an element to be associatedwith more than one occurrence of any declarable element.
  5. Selecting an element which contains multiple occurrences of agiven declarable element is semantically equivalent to selecting onlythose contained elements which are specified as defaults.
  6. An association made by one element applies by defaultto all of its descendants.

15.4 言語学的アノテーション of Corpora

Language corpora often include analytic encodings or annotations,designed to support a variety of different views of language. Thepresent Guidelines do not advocate any particular approach to linguisticannotation (or ‘tagging’); instead a number ofgeneral analytic facilities are provided which support therepresentation of most forms of annotation in a standard andself-documenting manner. Analytic annotation is of importance in manyfields, not only in corpus linguistics, and is therefore discussed ingeneral terms elsewhere in theGuidelines.50The present section presents informally some particular applications ofthese general mechanisms to the specific practice of corpus linguistics.

15.4.1 Levels of Analysis

By linguistic annotation we mean here any annotationdetermined by an analysis of linguistic features of the text, excludingas borderline cases both the formal structural properties of the text(e.g. its division into chapters or paragraphs) and descriptiveinformation about its context (the circumstances of its production, itsgenre, or medium). The structural properties of any TEI-conformant textshould be represented using the structural elements discussed elsewherein these Guidelines, for example in chapters 3 コアモジュール and4 テキスト構造モジュール.The contextualproperties of a TEI text are fully documented in the TEI Header, whichis discussed in chapter 2 TEIヘダー, and in section 15.2 Contextual Information of the present chapter.

Other forms of linguistic annotation may be applied at a number oflevels in a text. A code (such as a word-class or part-of-speechcode) may be associated with each word or token, or with groups of suchtokens, which may be continuous, discontinuous, or nested. A code mayalso be associated with relationships (such as cohesion) perceived asexisting between distinct parts of a text. The codes themselves maystand for discrete non-decomposable categories, or they may representhighly articulated bundles of textual features. Their function may beto place the annotated part of the text somewhere within a narrowlylinguistic or discoursal domain of analysis, or within a more generalsemantic field, or any combination drawn from these and other domains.

The manner by which such annotations are generated and attached tothe text may be entirely automatic, entirely manual, or a mixture. Theease and accuracy with which analysis may be automated may vary with thelevel at which the annotation is attached. The method employed shouldbe documented in the interpretation element within the encodingdescription of the TEI Header, as described in section 2.3.3 編集方法宣言. Where different parts of a corpus have used differentannotation methods, the decls attribute may be used toindicate the fact, as further discussed in section 15.3 Associating ContextualInformation with a Text.

An extended example of one form of linguistic analysis commonlypractised in corpus linguistics is given in section 17.4 言語学的アノテーション.

15.5 Recommendations for the Encoding of Large Corpora

These Guidelines include proposals for the identification andencoding of a far greater variety of textual features andcharacteristics than is likely to be either feasible or desirable inany one language corpus, however large and ambitious. The reasoningbehind this catholic approach is further discussed in chapter iv ガイドラインについて. For most large scale corpus projects, it will thereforebe necessary to determine a subset of TEI recommended elementsappropriate to the anticipated needs of the project, as furtherdiscussed in chapter 23.2 Personalization and Customization; these mechanisms includethe ability to exclude selected element types, add new element types,and change the names of existing elements. A discussion of theimplications of such changes for TEI conformance is provided inchapter 23.3 Conformance.

Because of the high cost of identifying and encoding many textualfeatures, and the difficulty in ensuring consistent practice across verylarge corpora, encoders may find it convenient to divide the set ofelements to be encoded into the following three categories:
required
texts included within the corpus will alwaysencode textual features in this category, should they exist in thetext
recommended
textual features in this category will beencoded wherever economically and practically feasible; where presentbut not encoded, a note in the header should be made.
optional
textual features in this category may or may notbe encoded; no conclusion about the absence of such features can beinferred from the absence of the corresponding element in a given text.

15.6 コーパスモジュール

The module described in this chapter makes available the following components: Theselection and combination of modules to form a TEI schema is describedin 1.2 TEIスキーマの定義.

Contents « 14 Tables, Formulae, and Graphics » 16 Linking, Segmentation, and Alignment

注釈
48.
Schemes similar to that proposed here were developedin the 1960s and 1970s by researchers such as Hymes, Halliday, andCrystal and Davy, but have rarely been implemented; one notableexception being the pioneering work on the Helsinki Diachronic Corpusof English, on which see Kytö and Rissanen (1988)
49.
It is particularly useful todefine participants in a dramatic text in this way, since it enables thewho attribute to be used to link sp elements todefinitions for their speakers; see further section 7.2.2 Speeches and Speakers.


Copyright TEIコンソーシアム 2007 Licensed under the GPL. Copying and redistribution is permitted and encouraged.
Version 1.0.