15 Language Corpora
Contents
The term language corpus is used to mean a number ofrather different things. It may refer simply to any collection oflinguistic data (written, spoken, or a mixture of the two), althoughmany practitioners prefer to reserve it for collections which havebeen organized or collected with a particular end in view, generally tocharacterize a particular state or variety of one or more languages.Because opinions as to the best method of achieving this goal differ,various subcategories of corpora have also been identified. For ourpurposes however, the distinguishing characteristic of a corpus is thatits components have been selected or structured according to someconscious set of design criteria.
These design criteria may be very simple and undemanding, or verysophisticated. A corpus may be intended to represent (in thestatistical sense) a particular linguistic variety or sublanguage, orit may be intended to represent all aspects of some assumed‘core’ language. A corpus may be made up of wholetexts or of fragments or text samples. It may be a‘closed’ corpus, or an ‘open’ or‘monitor’ corpus, the composition of which maychange over time. However, since an open corpus is of necessityfinite at any particular point in time, the only likely effect of itsexpansibility from the encoding point of view may be some increaseddifficulty in maintaining consistent encoding practices (see furthersection 15.5 Recommendations for the Encoding of Large Corpora). For simplicity, therefore, ourdiscussion largely concerns ways of encoding closed corpora, regardedas single but composite texts.
Language corpora are regarded by these Guidelines ascomposite texts rather than unitary texts(on this distinction, see chapter 4 テキスト構造モジュール). This isbecause although each discrete sample of language in a corpus clearlyhas a claim to be considered as a text in its own right, it is alsoregarded as a subdivision of some larger object, if only forconvenience of analysis. Corpora share a number of characteristicswith other types of composite texts, including anthologies andcollections. Most notably, different components of composite textsmay exhibit different structural properties (for example, some may becomposed of verse, and others of prose), thus potentially requiringelements from different TEI modules.
Aside from these high-level structural differences, and possiblydifferences of scale, the encoding of language corpora and theencoding of individual texts present identical sets of problems. Anyof the encoding techniques and elements presented in other chapters ofthese Guidelines may therefore prove relevant to some aspect of corpusencoding and may be used in corpora. Therefore, we do not repeat herethe discusssion of such fundamental matters as the representation ofmultiple character sets (see chapter vi 言語と文字集合); nor do weattempt to summarize the variety of elements provided for encodingbasic structural features such as quoted or highlighted phrases, crossreferences, lists, notes, editorial changes and reference systems (seechapter 3 コアモジュール). In addition to these general purposeelements, these Guidelines offer a range of more specialized sets oftags which may be of use in certain specialized corpora, for examplethose consisting primarily of verse (chapter 6 韻文),drama (chapter 7 Performance Texts), transcriptions of spoken text(chapter 8 Transcriptions of Speech), etc. Chapter 1 TEIの基礎構造should be reviewed for details of how these and other components ofthe Guidelines should be tailored to create a document type definitionappropriate to a given application. In sum, it should not be asssumedthat only the matters specifically addressed in this chapter are ofimportance for corpus creators.
This chapter does however include some other materialrelevant to corpora and corpus-building, for which no other locationappeared suitable. It begins with a review of the distinction betweenunitary and composite texts, and of the different methods provided bythese Guidelines for representing composite texts of different kinds(section 15.1 Varieties of Composite Text). Section 15.2 Contextual Information describes aset of additional header elements provided for the documentation ofcontextual information, of importance largely though not exclusively tolanguage corpora. This is the additional module for language corporaproper. Section 15.3 Associating ContextualInformation with a Text discusses a mechanism by whichindividual parts of the TEI Header may be associated with differentparts of a TEI-conformant text. Section 15.4 言語学的アノテーション of Corpora reviewsvarious methods of providing linguistic annotation in corpora, with somespecific examples of relevance to current practice in corpuslinguistics. Finally, section 15.5 Recommendations for the Encoding of Large Corpora provides some generalrecommendations about the use of these Guidelines in the building oflarge corpora.
15.1 Varieties of Composite TextTEI: Varieties of Composite Text¶
- teiCorpus contains the whole of a TEI encoded corpus, comprising asingle corpus header and one or more TEI elements, each containinga single text header and a text.
- TEI (TEI document) contains a single TEI-conformant document,comprising a TEI header and a text, either in isolation or as part of ateiCorpus element.
- teiHeader (TEI Header) supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.
type specifies the kind of document to which the header is attached, for example whether it is a corpus or individual text. - text contains a single text of any kind, whether unitary orcomposite, for example a poem or drama, a collection of essays, a novel,a dictionary, or a corpus sample.
- group contains the body of a composite text, grouping together asequence of distinct texts (or groups of such texts) which are regardedas a unit for some purpose, for example the collected works of anauthor, a sequence of prose essays, etc.
- language corpora
- collections or anthologies
- poem cycles and epistolary works (novels or essays writtenin the form of collections or series of letters)
- otherwise unitary texts, within which one or more subordinatetexts are embedded
In corpora, the component samples are clearly distinct texts, but thesystematic collection, standardized preparation, and common markup ofthe corpus often make it useful to treat the entire corpus as a unit,too. Some corpora may become so well established as to be regarded astexts in their own right; the Brown and LOB corpora are now close toachieving this status.
<teiHeader type="corpus"/>
<TEI>
<teiHeader type="text"/>
<text/>
</TEI>
<TEI>
<teiHeader type="text"/>
<text/>
</TEI>
</teiCorpus>
Header information which relates to the whole corpus rather than toindividual components of it should be factored out and included in theteiHeader element prefixed to the whole. This two-levelstructure allows for contextual information to be specified at thecorpus level, at the individual text level, or at both. Discussion ofthe kinds of information which may thus be specified is providedbelow, in section 15.2 Contextual Information, as well as in chapter 2 TEIヘダー. Information of this type should in general bespecified only once: a variety of methods are provided for associatingit with individual components of a corpus, as further described insection 15.3 Associating ContextualInformation with a Text.
In some cases, the design of a corpus is reflected in its internalstructure. For example, a corpus of newspaper extracts might bearranged to combine all stories of one type (reportage, editorial,reviews, etc.) into some higher-level grouping, possibly with sub-groupsfor date, region, etc. The teiCorpus element provides nodirect support for reflecting such internal corpus structure in themarkup: it treats the corpus as an undifferentiated series ofcomponents, each tagged TEI.
If it is essential to reflect a single permanent organization of acorpus into sub- and sub-sub-corpora, then the corpus or the high-levelsubcorpora may be encoded as composite texts, using the groupelement described below and in section 4.3.1 複合テキスト. Themechanisms for corpus characterization described in this chapter,however, are designed to reduce the need to do this. Useful groupingsof components may easily be expressed using the text classification andidentification elements described in section 15.2.1 The Text Description,and those for associating declarations with corpus components describedin section 15.3 Associating ContextualInformation with a Text. These methods also allow severaldifferent methods of text grouping to co-exist, each to be used asneeded at different times. This helps minimize the danger ofcross-classification and mis-classification of samples, and helpsimprove the flexibility with which parts of a corpus may becharacterized for different applications.
Anthologies and collections are often treated as texts in their ownright, if only for historical reasons. In conventional publishing, atleast, anthologies are published as units, with single editorialresponsibility and common front and back matter which may need to beincluded in their electronic encodings. The texts collected in theanthology, of course, may also need to identifiable as distinctindividual objects for study.
Poem cycles, epistolary novels, and epistolary essays differ fromanthologies in that they are often written as single works, by singleauthors, for single occasions; nevertheless, it can be useful to treattheir constituent parts as individual texts, as well as the cycleitself. Structurally, therefore, they may be treated in the same wayas anthologies: in both cases, the body of the text is composedlargely of other texts.
The group element is provided to simplify the encoding ofcollections, anthologies, and cyclic works; as noted above, thegroup element can also be used to record the potentiallycomplex internal structure of language corpora. For a full description,see chapter 4 テキスト構造モジュール.
Some composite texts, finally, are neither corpora, nor anthologies,nor cyclic works: they are otherwise unitary texts within which othertexts are embedded. In general, they may be treated in the same way asunitary texts, using the normal TEI andbody elements. The embedded text itself may be encoded usingthe text element, which may occur within quotations or betweenparagraphs or other chunk-level elements inside the sections of a largertext. For further discussion, see chapter 4 テキスト構造モジュール.
All composite texts share the characteristic that their differentcomponent texts may be of structurally similar or dissimilar types. Ifall component texts may all be encoded using the same module,then no problem arises. If however they requiredifferent modules, then these must be included in the schema. Thisprocess is described in more detail in section 1.1 TEIのモジュール.
15.2 Contextual InformationTEI: Contextual Information¶
Contextual information is of particular importance for collectionsor corpora composed of samples from a variety of different kinds oftext. 例 of such contextual information include: the age, sex,and geographical origins of participants in a language interaction, ortheir socio-economic status; the cost and publication data of anewspaper; the topic, register or factuality of an extract from atextbook. Such information may be of the first importance, whether asan organizing principle in creating a corpus (for example, to ensurethat the range of values in such a parameter is evenly representedthroughout the corpus, or represented proportionately to the populationbeing sampled), or as a selection criterion in analysing the corpus(for example, to investigate the language usage of some particularvector of social characteristics).
Such contextual information is potentially of equal importance forunitary texts, and these Guidelines accordingly make no particulardistinction between the kinds of information which should be gatheredfor unitary and for composite texts. In either case, the informationshould be recorded in the appropriate section of a TEI Header, asdescribed in chapter 2 TEIヘダー. In the case of language corpora,such information may be gathered together in the overall corpus header,or split across all the component texts of a corpus, in their individualheaders, or divided between the two. The association between anindividual corpus text and the contextual information applicable to itmay be made in a number of ways, as further discussed in section 15.3 Associating ContextualInformation with a Text below.
Chapter 2 TEIヘダー, which should be read in conjunction withthe present section, describes in full the range of elements availablefor the encoding of information relating to the electronic file itself,for example its bibliographic description and those of the source orsources from which it was derived (see section 2.2 ファイル解説);information about the encoding practices followed with the corpus, forexample its design principles, editorial practices, reference system,etc. (see section 2.3 符号化解説); more detailed descriptiveinformation about the creation and content of the corpus, such as thelanguages used within it and any descriptive classification system used(see section 2.4 プロファイル解説); and version information documenting anychanges made in the electronic text (see section 2.5 改訂解説).
In addition to the elements defined by chapter 2 TEIヘダー,several other elements can be used in the TEI header if the additionalmodule defined by this chapter is invoked. These additional tags makeit possible to characterize the social or other situation within which alanguage interaction takes place or is experienced, the physical settingof a language interaction, and the participants in it. Though thisinformation may be relevant to, and provided for, unitary texts as wellas for collections or corpora, it is more often recorded for thecomponents of systematically developed corpora than for isolated texts,and thus this module is referred to as being ‘for languagecorpora’.
- textDesc (text description) provides a description of a text in terms of itssituational parameters.
- particDesc (participation description) describes the identifiable speakers, voices, or other participantsin a linguistic interaction.
- settingDesc (setting description) describes the setting or settings within which a languageinteraction takes place, either as a prose description or as aseries of setting elements.
15.2.1 The Text DescriptionTEI: The Text Description¶
- channel (primary channel) describes the medium or channel by which a text is delivered orexperienced. For a written text, this might be print, manuscript, e-mail, etc.;for a spoken one, radio, telephone, face-to-face, etc.
mode specifies the mode of this channel with respect to speech and writing. - constitution describes the internal composition of a text or text sample,for exampleas fragmentary, complete, etc.
type specifies how the text was constituted. - derivation describes the nature and extent of originality of this text.
type categorizes the derivation of the text. - domain (domain of use) describes the most important social context in which the text wasrealized or for which it is intended, for example private vs. public,education, religion, etc.
type categorizes the domain of use. - factuality describes the extent to which the text may be regarded asimaginative or non-imaginative, that is, as describing a fictionalor a non-fictional world.
type categorizes the factuality of the text. - interaction describes the extent, cardinality and nature of any interactionamong those producing and experiencing the text, for example in theform of response or interjection, commentary etc.
type specifies the degree of interaction between active and passive participants in the text. active specifies the number of active participants (or addressors) producing parts of the text. passive specifies the number of passive participants (or addressees) to whom a text is directed or in whose presence it is created or performed. - preparedness describes the extent to which a text may be regarded asprepared or spontaneous.
type a keyword characterizing the type of preparedness. - purpose characterizes a single purpose or communicative function of thetext.
type specifies a particular kind of purpose. degree specifies the extent to which this purpose predominates.
These elements constitute a model class called model.textDescPart; new parameters may be definedby defining new elements and adding them to that class, as furtherdescribed in 23.2 Personalization and Customization.
By default, a text description will contain each of the aboveelements, supplied in the order specified. Except for thepurpose element, which may be repeated to indicate multiplepurposes, no element should appear more than once within a single textdescription. Each element may be empty, or may contain a briefqualification or more detailed description of the value expressed byits attributes. It should be noted that some texts, in particularliterary ones, may resist unambiguous classification in some of thesedimensions; in such cases, the situational parameter in questionshould be given the content ‘not applicable’ or an equivalentphrase.
- it enables a relatively continuous characterization of texts (incontrast to discrete categories based on type or topic)
- it enables meaningful comparisons across corpora
- it allows analysts to build and compare their own text-typesbased on the particular parameters of interest to them
- it is equally applicable to spoken and written texts
Two alternative approaches to the use of these parameters aresupported by these Guidelines. One is to use pre-existing taxonomiessuch as those used in subject classification or other types of textcategorization.Such taxonomies may also be appropriate for the description of thetopics addressed by particular texts. Elements for this purpose aredescribed in section 2.4.3 分類, and elements for defining ordeclaring such classification schemes in section 2.3.6 分類法宣言. Asecond approach is to develop an application-specific set offeature structures and an associated feature systemdeclaration, as described inchapters 18 素性構造 and 18.11 素性システム宣言.
Where the organizing principles of a corpus or collection so permit,it may be convenient to regard a particular set of values for thesituational parameters listed in this section as forming atext-type in its own right; this may also be useful wherethe same set of values applies to several texts within a corpus. Insuch a case, the set of text-types so defined should be regarded as ataxonomy. The mechanisms described in section 2.3.6 分類法宣言 may be used to define hierarchic taxonomies of suchtext-types, provided that the catDesc component of thecategory element contains a textDesc element ratherthan a prose description. Particular texts may then be associated withsuch definitions using the mechanisms described in sections 2.4.3 分類.
<channel mode="s">informal face-to-face conversation</channel>
<constitution type="single">each text represents a continuously
recorded interaction among the specified participants
</constitution>
<derivation type="original"/>
<domain type="domestic">plans for coming week, local affairs</domain>
<factuality type="mixed">mostly factual, some jokes</factuality>
<interaction type="complete" active="plural" passive="many"/>
<preparedness type="spontaneous"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>
<channel mode="w">print; part issues</channel>
<constitution type="single"/>
<derivation type="original"/>
<domain type="art"/>
<factuality type="fiction"/>
<interaction type="none"/>
<preparedness type="prepared"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>
15.2.2 The Participant DescriptionTEI: The Participant Description¶
The particDesc element in the profileDesc elementprovides additional information about the participants in a spokentext or, where this is judged appropriate, the persons named ordepicted in a written text. When the detailed elements provided bythe namesdates module described in 13 Names, Dates, People, and Places are included in a schema, this element cancontain detailed demographic or descriptive information aboutindividual speakers or groups of speakers, such as their names orother personal characteristics. Individually identified persons mayalso identified by a code which can then be used elsewhere within theencoded text, for example as the value of a whoattribute.
It should be noted that although the terms speaker orparticipant are used throughout this section, it isintended that the same mechanisms may be used to characterize fictionalpersonæ or ‘voices’ within a written text, exceptwhere otherwise stated. For the purposes of analysis of language usage,the information specified here should be equally applicable to writtenand spoken texts.
The element particDesc contains a description of theparticipants in an interaction, which may be supplied asstraightforward prose, possibly containing a list of names, encodedusing the usual list and name elements, oralternatively using the more specific and detailed listPersonelement provided by the namesdates moduledescribed in 13 Names, Dates, People, and Places.
<p>Female informant, well-educated, born in Shropshire UK, 12 Jan
1950, of unknown occupation. Speaks French fluently.
Socio-Economic status B2 in the PEP classification scheme.</p>
</particDesc>
<birth when="1950-01-12">
<date>12 Jan 1950</date>
<name type="place">Shropshire, UK</name>
</birth>
<langKnowledge tags="en fr">
<langKnown level="first" tag="en">English</langKnown>
<langKnown tag="fr">French</langKnown>
</langKnowledge>
<residence>Long term resident of Hull</residence>
<education>University postgraduate</education>
<occupation>Unknown</occupation>
<socecStatus scheme="#pep" code="#b2"/>
</person>
<p>The chief speaking characters in this novel are
<list>
<item xml:id="EMWOO">
<name>Emma Woodhouse</name>
</item>
<item xml:id="DARCY">
<name>Mr Darcy</name>
</item>
<!-- ... -->
</list>
</p>
</particDesc>
15.2.3 The Setting DescriptionTEI: The Setting Description¶
The settingDesc element is used to describe the setting orsettings in which language interaction takes place. It may contain aprose description, analogous to a stage description at the start of aplay, stating in broad terms the locale, or a more detaileddescription of a series of such settings.
- setting describes one particular setting in which a language interaction takes place.
- name (name, proper noun) contains a proper noun or noun phrase.
type indicates the type of the object which is being named by the phrase. - date contains a date in any format.
- time contains a phrase defining a time of day in any format.
- locale contains a brief informal description of the nature of aplace for example a room, a restaurant, a park bench etc.
- activity contains a brief informal description of what a participant in alanguage interaction is doing other than speaking, if anything.
<p>The time is early spring, 1989. P1 and P2 are playing on the rug
of a suburban home in Bedford. P3 is doing the washing up at the
sink. P4 (a radio announcer) is in a broadcasting studio in
London.</p>
</settingDesc>
<setting who="#p1 #p2">
<name type="city">Bedford</name>
<name type="region">UK: South East</name>
<date>early spring, 1989</date>
<locale>rug of a suburban home</locale>
<activity>playing</activity>
</setting>
<setting who="#p3">
<name type="city">Bedford</name>
<name type="region">UK: South East</name>
<date>early spring, 1989</date>
<locale>at the sink</locale>
<activity>washing-up</activity>
</setting>
<setting who="#p4">
<name type="place">London, UK</name>
<time>unknown</time>
<locale>broadcasting studio</locale>
<activity>radio performance</activity>
</setting>
</settingDesc>
15.3 Associating ContextualInformation with a TextTEI: Associating ContextualInformation with a Text¶
This section discusses the association of the contextual informationheld in the header with the individual elements making up a TEI text orcorpus. Contextual information is held in elements of various kindswithin the TEI header, as discussed elsewhere in this section and inchapter 2 TEIヘダー. Here we consider what happens when differentparts of a document need to be associated with different contextualinformation of the same type, for example when one part of a documentuses a different encoding practice from another, or where one partrelates to a different setting from another. In such situations, therewill be more than one instance of a header element of the relevant type.
- A given element may appear in the corpus header only, in theheader of one or more texts only, or in both places
- There may be multiple occurrences of certain elements in eithercorpus or text header.
To simplify the exposition, we deal with these two possibilitiesseparately in what follows; however, they may be combined asdesired.
15.3.1 Combining Corpus and Text HeadersTEI: Combining Corpus and Text Headers¶
A TEI conformant document may have more than one header only in thecase of a TEI corpus, which must have a header in its own right, as wellas the obligatory header for each text. Every element specified in acorpus-header is understood as if it appeared within every text headerin the corpus. An element specified in a text header but not in thecorpus header supplements the specification for that text alone. If anyelement is specified in both corpus and text headers, the corpus headerelement is over-ridden for that text alone.
The titleStmt for a corpus text is understood to beprefixed by the titleStmt given in the corpus header. Allother optional elements of the fileDesc should be omitted froman individual corpus text header unless they differ from thosespecified in the corpus header. All other header elements behaveidentically, in the manner documented below.This facility makes it possible to state once for all in the corpusheader each piece of contextual information which is common to the wholeof the corpus, while still allowing for individual texts to vary fromthis common denominator.
<teiHeader>
<fileDesc>
<!-- corpus file description-->
</fileDesc>
<encodingDesc>
<!-- default encoding description -->
</encodingDesc>
<revisionDesc>
<!-- corpus revision description -->
</revisionDesc>
</teiHeader>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
</teiHeader>
<text>
<!-- first corpus text -->
</text>
</TEI>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
<encodingDesc>
<!-- encoding description for this corpus text, over-riding the default -->
</encodingDesc>
</teiHeader>
<text>
<!-- second corpus text -->
</text>
</TEI>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for third corpus text -->
</fileDesc>
</teiHeader>
<text>
<!-- third corpus text -->
</text>
</TEI>
</teiCorpus>
15.3.2 Declarable ElementsTEI: Declarable Elements¶
Certain of the elements which can appear within a TEI Header areknown as declarable elements. These elements have incommon the fact that they may be linked explicitly with a particularpart of a text or corpus by means of a decls attribute onthat element. This linkage is used to over-ride the defaultassociation between declarations in the header and a corpus or corpustext. The only header elements which may be associated in this way arethose which would not otherwise be meaningfully repeatable.
- att.declarable provides attributes for those elements in the TEI Header which may be independently selected by means of the special purpose decls attribute.
default indicates whether or not this element is selected by default whenits parent is selected. - att.declaring provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element.
decls identifies one or more declarable elements within theheader, which are understood to apply to the element bearing thisattribute and its content.
- bibl (bibliographic citation) contains a loosely-structured bibliographic citation of whichthe sub-components may or may not be explicitly tagged.
- biblFull (fully-structured bibliographic citation) contains a fully-structured bibliographic citation, in which allcomponents of the TEI file descriptionare present.
- biblStruct (structured bibliographic citation) contains a structured bibliographic citation, in which onlybibliographic subelements appear and in a specified order.
- broadcast describes a broadcast used as the source of a spoken text.
- correction (correction principles) states how and under what circumstances corrections have beenmade in the text.
- editorialDecl (editorial practice declaration) provides details of editorial principles and practices appliedduring the encoding of a text.
- equipment provides technical details of the equipment and media used foran audio or video recording used as the source for a spoken text.
- hyphenation summarizes the way in which hyphenation in a source text has beentreated in an encoded version of it.
- interpretation describes the scope of any analytic or interpretive informationadded to the text in addition to the transcription.
- langUsage (language usage) describes the languages, sublanguages, registers, dialects etc.represented within a text.
- listBibl (citation list) contains a list of bibliographic citations of any kind.
- normalization indicates the extent of normalization or regularization of theoriginal source carried out in converting it to electronic form.
- particDesc (participation description) describes the identifiable speakers, voices, or other participantsin a linguistic interaction.
- projectDesc (project description) describes in detail the aim or purpose for which an electronicfile was encoded, together with any other relevant informationconcerning the process by which it was assembled or collected.
- quotation specifies editorial practice adopted with respect to quotation marks in the original.
- recording (recording event) details of an audio or video recording eventused as the source of a spoken text, either directly or froma public broadcast.
- samplingDecl (sampling declaration) contains a prose description of the rationale and methods usedin sampling texts in the creation of a corpus or collection.
- scriptStmt (script statement) contains a citation giving details of the script used fora spoken text.
- segmentation describes the principles according to which the text has beensegmented, for example into sentences, tone-units, graphemic strata,etc.
- sourceDesc (source description) supplies a description of the source text(s) fromwhich an electronic text was derived or generated.
- stdVals (standard values) specifies the format used when standardized date or numbervalues are supplied.
- textClass (text classification) groups information which describes the nature or topic of a textin terms of a standard classification scheme, thesaurus, etc.
- textDesc (text description) provides a description of a text in terms of itssituational parameters.
- every declarable element must bear a unique identifier
- for each different type of declarable element which occurs morethan once within the same parent element, exactly one element must bespecified as the default, by means of the default attribute
<correction xml:id="CorPol1" default="true">
<p> ... </p>
</correction>
<correction xml:id="CorPol2">
<p> ... </p>
</correction>
<normalization xml:id="n1">
<p> ... </p>
<p> ... </p>
</normalization>
</editorialDecl>
<body>
<div1 n="d1"/>
<div1 n="d2" decls="#CorPol2"/>
<div1 n="d3"/>
</body>
</text>
The decls attribute is defined for any element which is amember of the class declaring. This includes the majorstructural elements text, group, and div, aswell as smaller structural units, down to the level of paragraphs inprose, individual utterances in spoken texts, and entries indictionaries. However, TEI recommended practice is to limit the numberof multiple declarable elements used by a document as far as possible,for simplicity and ease of processing.
- An identifier specifying an element which contains multipleinstances of one or more other elements should be interpreted as if itexplicitly identified the elements identified as the default in eachsuch set of repeated elements
- Each element specified, explicitly or implicitly, by the list ofidentifiers must be of a different type.
<editorialDecl xml:id="ED1" default="true">
<correction xml:id="C1A" default="true">
<p> ... </p>
</correction>
<correction xml:id="C1B">
<p> ... </p>
</correction>
<normalization xml:id="N1">
<p> ... </p>
<p> ... </p>
</normalization>
</editorialDecl>
<editorialDecl xml:id="ED2">
<correction xml:id="C2A" default="true">
<p> ... </p>
</correction>
<correction xml:id="C2B">
<p> ... </p>
</correction>
<normalization xml:id="N2A">
<p> ... </p>
</normalization>
<normalization xml:id="N2B" default="true">
<p> ... </p>
</normalization>
</editorialDecl>
</encodingDesc>
This encoding description now has two editorial declarations,identified as ED1 (the default) and ED2. For texts not specifyingotherwise, ED1 will apply. If ED1 applies, correction method C1a andnormalization method N1 apply, since these are the specified defaultswithin ED1. In the same way, for a text specifying decls as‘ED2’, correction C2a, and normalization N2b willapply.
A finer grained approach is also possible. A text might specifytext decls='C2b N2a', to ‘mix and match’ declarations asrequired. A tag such as text decls='ED1 ED2' would(obviously) be illegal, since it includes two elements of the same type;a tag such as text decls='ED2 C1a' is also illegal, since inthis context ED2 is synonymous with the defaults for thateditorial declaration, namely C2a N2b, resulting in a listthat identifies two correction elements (C1a and C2a).
15.3.3 SummaryTEI: Summary¶
- If there is a single occurrence of a given declarableelement in a corpus header, then it applies by default to all elementswithin the corpus.
- If there is a single occurrence of a given declarableelement in the text header, then it applies by default to all elementsof that text irrespective of the contents of the corpus header.
- Where there are multiple occurrences of declarable elementswithin either corpus or text header,
- each must have a unique value specified as the valueof its xml:id attribute;
- one only must bear a default attribute withthe value YES.
- It is a semantic error for an element to be associatedwith more than one occurrence of any declarable element.
- Selecting an element which contains multiple occurrences of agiven declarable element is semantically equivalent to selecting onlythose contained elements which are specified as defaults.
- An association made by one element applies by defaultto all of its descendants.
15.4 言語学的アノテーション of CorporaTEI: 言語学的アノテーション of Corpora¶
Language corpora often include analytic encodings or annotations,designed to support a variety of different views of language. Thepresent Guidelines do not advocate any particular approach to linguisticannotation (or ‘tagging’); instead a number ofgeneral analytic facilities are provided which support therepresentation of most forms of annotation in a standard andself-documenting manner. Analytic annotation is of importance in manyfields, not only in corpus linguistics, and is therefore discussed ingeneral terms elsewhere in theGuidelines.50The present section presents informally some particular applications ofthese general mechanisms to the specific practice of corpus linguistics.
15.4.1 Levels of AnalysisTEI: Levels of Analysis¶
By linguistic annotation we mean here any annotationdetermined by an analysis of linguistic features of the text, excludingas borderline cases both the formal structural properties of the text(e.g. its division into chapters or paragraphs) and descriptiveinformation about its context (the circumstances of its production, itsgenre, or medium). The structural properties of any TEI-conformant textshould be represented using the structural elements discussed elsewherein these Guidelines, for example in chapters 3 コアモジュール and4 テキスト構造モジュール.The contextualproperties of a TEI text are fully documented in the TEI Header, whichis discussed in chapter 2 TEIヘダー, and in section 15.2 Contextual Information of the present chapter.
Other forms of linguistic annotation may be applied at a number oflevels in a text. A code (such as a word-class or part-of-speechcode) may be associated with each word or token, or with groups of suchtokens, which may be continuous, discontinuous, or nested. A code mayalso be associated with relationships (such as cohesion) perceived asexisting between distinct parts of a text. The codes themselves maystand for discrete non-decomposable categories, or they may representhighly articulated bundles of textual features. Their function may beto place the annotated part of the text somewhere within a narrowlylinguistic or discoursal domain of analysis, or within a more generalsemantic field, or any combination drawn from these and other domains.
The manner by which such annotations are generated and attached tothe text may be entirely automatic, entirely manual, or a mixture. Theease and accuracy with which analysis may be automated may vary with thelevel at which the annotation is attached. The method employed shouldbe documented in the interpretation element within the encodingdescription of the TEI Header, as described in section 2.3.3 編集方法宣言. Where different parts of a corpus have used differentannotation methods, the decls attribute may be used toindicate the fact, as further discussed in section 15.3 Associating ContextualInformation with a Text.
An extended example of one form of linguistic analysis commonlypractised in corpus linguistics is given in section 17.4 言語学的アノテーション.
15.5 Recommendations for the Encoding of Large CorporaTEI: Recommendations for the Encoding of Large Corpora¶
These Guidelines include proposals for the identification andencoding of a far greater variety of textual features andcharacteristics than is likely to be either feasible or desirable inany one language corpus, however large and ambitious. The reasoningbehind this catholic approach is further discussed in chapter iv ガイドラインについて. For most large scale corpus projects, it will thereforebe necessary to determine a subset of TEI recommended elementsappropriate to the anticipated needs of the project, as furtherdiscussed in chapter 23.2 Personalization and Customization; these mechanisms includethe ability to exclude selected element types, add new element types,and change the names of existing elements. A discussion of theimplications of such changes for TEI conformance is provided inchapter 23.3 Conformance.
- required
- texts included within the corpus will alwaysencode textual features in this category, should they exist in thetext
- recommended
- textual features in this category will beencoded wherever economically and practically feasible; where presentbut not encoded, a note in the header should be made.
- optional
- textual features in this category may or may notbe encoded; no conclusion about the absence of such features can beinferred from the absence of the corresponding element in a given text.
15.6 コーパスモジュールTEI: コーパスモジュール¶
↑ Contents « 14 Tables, Formulae, and Graphics » 16 Linking, Segmentation, and Alignment