The CGN Multi-word lexicon
|
1 March 2004 |
|
|
|
Richard Piepenbrock |
|
Mila Groot |
|
Raffaela Vlot |
|
Maarten Jansonius |
General information
The CGN Multi-word lexicon, as available in version 1.0 of the Spoken Dutch Corpus, is based on an inventory
of all multi-word expressions that occur in a number of sources (CELEX 1,
RBN 2, Woordenlijst Nederlandse Taal (Groene Boekje,
1995), Corpus Uit den Boogaart 3) and the Van
Dale Groot Woordenboek der Nederlandse Taal 4),
complemented with all multi-word expressions that were encountered in the Corpus.
The lexicon only includes multi-word expressions that occur in the corpus.
Format and contents of the CGN Multi-word lexicon
The lexicon is available in two formats:
- A standard text file (flat ASCII) with the
name cgnlex.txt. The backslash ('\') is used as separator. Letters with diacritics
are represented in SGML format. This file can be opened by means of a simple
text editor, or a database system such as Access, ORACLE or dBase.
- An XML file with the name cgnmlex.lex. This
file can be opened by any XML browser or editor, and then searched for certain
values. The associated DTD (Document Type Definition) mlex.dtd is also avaliable.
The Multi-word lexifcon comprises 11 columns. Both
lexicon files have been ordered occording to the orthographic multi-word (Orthografie
Meerwoord) and then the multi-word part of speech (Woordsoort Meerwoord), the
ID number of the multi-word lemma (Id-Nummer Meerwoordslemma) and the rank number
of the members of the multi-word expression (Volgnummer van de leden binnen de
meerwoordsuitdrukking).
Number of unique multi-word expressions |
23,567 |
Number of unique multi-word lemmas |
18,593 |
Number of entries for multi-words |
53,704 |
Contents of the lexicon fields
- CGN_MLEXICON.Orthografie Meerwoord ::= ([0-9][A-Z][a-z][
&'*-;])+
Orthographic representation of the multi-word
expression. The inflection paradigm for the multi-word lemma has been included
here to the extent that inflected forms occur in the corpus. Diacritics are
represented in SGML format as follows:
"&" + capital/small letter + diacritic
+ ";"
In concreto:
"&" + |
"a" + |
"grave" |
+ ";" |
|
"c" |
"acute" (= aigu) |
|
|
"e" |
"circ" (= circonflexe) |
|
|
"i" |
"uml" (= trema) |
|
|
"n" |
"cedil" (= cedille) |
|
|
"o" |
"tilde" |
|
|
"u" |
"ring" |
|
|
"A" |
|
|
|
"C" |
|
|
|
"E" |
|
|
|
"I" |
|
|
|
"N" |
|
|
|
"O" |
|
|
|
"U" |
|
|
b.v. |
'à la carte' voor 'à
la carte' |
|
|
|
en |
|
|
|
'Gustaf Åkermans' voor 'Gustaf
Åkermans' |
|
|
The SGML symbol '&' is used to represent
the ampersand ('&').
- CGN_MLEXICON.Volgnummer ::= [1-9]+
This number indicates the position of the word
form in the sentence relative to the other parts of the multi-word expression.
- CGN_MLEXICON.Orthografie Woordvorm ::= ([0-9][A-Z][a-z][&'-;])+
Orthographic representation of the word form,
i.e. the individual parts of the multi-word expression Diacritics are represented
as described above.
- CGN_MLEXICON.Woordsoort Woordvorm ::=
The part of speech of the word form, i.e. the
individual parts of the multi-word expression.
- "ADJ(" value ("," value)* ")" |
- "BW("")" |
- "LID(" value ("," value)* ") |
- "N(" value ("," value)* ")" |
- "SPEC(deeleigen)" |
- "SPEC(meta)" |
- "SPEC(onverst)" |
- "SPEC(vreemd)" |
- "TSW()" |
- "TW(" value ("," value)* ")" |
- "VG(" value ")" |
- "VNW(" value ("," value)* ")" |
- "VZ(" value ")" |
- "WW(" value ("," value)* ")"
Values for the open word classes are conform
the document Part of Speech Tagging en Lemmatisering (Van Eynde 2003):
- ADJ
- adjectief (= adjective)
- BW
- bijwoord (= adverb)
- LID
- lidwoord (= article)
- N
- substantief (= noun)
- SPEC(deeleigen)
- code for part of a compound proper name
- SPEC(meta)
- code for a mention
- SPEC(onverst)
- code for an incomprehensible utterance
- SPEC(vreemd)
- code for an utterance in a foreign language
- TSW
- tussenwerpsel (= interjection)
- TW
- telwoord (= numeral)
- VG
- voegwoord (= conjunction)
- VNW
- voornaamwoord (= pronoun)
- VZ
- voorzetsel (= preposition)
- WW
- werkwoord (= verb)
- CGN_MLEXICON.Woordsoort Meerwoord ::=
The part of speech of a multi-word expression
where from a grammatical point of view the full expression can be regarded as
one word. Values are the same as those found with the word form. In addition
we find
- COMB(eigen)
code for compound proper name or title
Warning: this field has only been
included in the text version of the lexicon, viz. cgnmlex.txt (and not
in the XML version cgnmlex.lex). It is a provisional code that may be
subject to change in the future.
- CGN_LEXICON.Id-Nummer Meerwoordslemma: ::=
[0-9]+
Rank num ber (Id = 'identification') which
indicates which mutli-word expressions belong to one and the same paradigm.
The distinction is obly relevant for possibly discontinuous verbs. Where orthographically
identical (multi-word) lemmas occur with different ID numbers this implies that
lemmas are involved with different morpho-syntactic (eg strong or weak declension)
or phonetic (eg stress) characteristics, in combination with a difference in
meaning. The difference in meaning is indicated in the field Definitie
Meerwoordslemma.
- CGN_MLEXICON.Meerwoordslemma ::= ([0-9][A-Z][a-z][&'*-;_])*
For lemmas of multi-word expressions such as
'uitademen' in multi-word instances like '(ik) adem uit'. With continuous multi-word
expressions, viz. fully integrated foreign expressions, compound proper
names and title, a dummy lemma form is postulated which is identical to the
expression (the parts are linked by means of underscores). For example,
pro forma\1\pro\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\
pro forma\2\forma\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\
Kim Clijsters\1\Kim\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J\
Kim Clijsters\2\Clijsters\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J
- CGN_LEXICON.Morfologie Meerwoordslemma
Hierarchical morphological segmentation of the
multi-word lemma. This representation concerns the multi-word lemma and only
comprises derivational and compositional morphology (no characterisation is
given of the inflectional characteristics of the word form). The morphological
segmentation is only relevant for possibly discontinuous verbs. The representation
is redundant in the sense that for each word form the morphological representation
is repeated. The differenr levels of segmentation (from the full multi-word
lemma to its morphemes) is represented in the form of a bracketing. The part
of speech for each morpheme is indicated between angled brackets. Bound morphemes
(affixes) have been indicated by means of periods, or the letter 'x' in case
the affix is discontinuous (together with a period before the other part).
Overview of word class codes:
- N = substantief (= noun)
- A = adjectief (= adjective)
- Q = telwoord (= numeral)
- V = werkwoord (= verb)
- D = lidwoord (= article)
- O = voornaamwoord (= pronoun)
- B = bijwoord ( = adverb)
- P = voorzetsel (= preposition)
- C = voegwoord (= conjunction)
- I = tussenwerpsel
(= interjection)
- X = restcategorie (= rest category)
- . = affix (= affix)
- x = deel van discontinu affix (= part
of a discontinuous affix)
The role of the affix in the derivation or composition
is indicated by means of a vertical bar, where the part of speech following
the bar refers to the parts fo speech of the morphemes that serve as input for
the morphological process, and the part of speech preceding the bar indicates
the part of speech of the output of the morphological process (viz. the part
of speech that is composed of the complex morpheme that has been formed by means
of other morphemes). Thus '[V|.A]' with 'voorverwarmen' represents the process
of affixing in which the adjective can be transformed to a verb by means of
the prefix 'ver-':
voorverwarmen ((voor)[B],((ver)[V|.A],(warm)[A])[V])[V]
Examples of morphological segmentation:
- dichtmaken:
- ((dicht)[A],(maak)[V])[V]
- navertellen:
- ((na)[P],((ver)[V|.V],(tel)[V])[V])[V]
- achteruitdeinzen:
- (((achter)[B],(uit)[B])[B],(deins)[V])[V]
- CGN_LEXICON.Definitie Meerwoordslemma
For all multi-word lemmas that have been included
more than once with one and the same part of speech (because they had distinctive
formal characteristics such as morpho-syntactic characteristics, gender or derivational
morphology) together with a difference in meaning, a compact definition has
been included so as to distinguish between the lemmas. This field is only relevant
for possibly discontinuous verbs Cases of such ambiguity will not occur within
the lexicon, but do occur in a comparison with the single word lexicon cgnlex.txt.
For example,
loopt door\WW(pv,tgw,met-t)\501446\doorlopen\((door)[B],(loop)[V])[V]\verder
lopen, vermengen van kleuren\J\N\
- CGN_MLEXICON.Optioneel lid ::= "J" | "N"
If the word form (Woordvorm) is an optional part
of a multi-word expression then the value of this field is 'J'.
Is the word form (Woordvorm) obligatory part of a multi-word expression, then
the value of the field is 'N'. Thus 'ademt' as part of 'inademen' and
'uitademen' has the value 'J', while 'apen' as part of 'na-apen'
receives the value 'N'.
- CGN_MLEXICON.Continu meerwoord ::= "J" |
"N"
If the multi-word expression cannot be interrupted
(by constituents other than hesitations or interjections), as for example
'Tien Voor Taal' or 'per se', the multi-word expression as a whole is given
the value 'J', or else 'N', as in the case of possibly discontinuous verbs.
1 Centrum
voor Lexicale Informatie. Interfacultaire Werkgroep Taal en Spraak, Universiteit
van Nijmegen & Max Planck Instituut voor Psycholinguïstiek, Nijmegen.
2 Referentiebestand
Nederlands. Vakgroep Lexicologie, Vrije Universiteit Amsterdam & Instituut
voor Nederlandse Lexicologie, Leiden & Departement Linguïstiek, Katholieke
Universiteit Leuven & Vakgroep Nederlands, Universiteit Utrecht.
3 Boogaart,
P.C. Uit den (1975). Woordfrequenties: in Geschreven en Gesproken Nederlands.
Utrecht: Oosthoek, Scheltema & Holkema. Electronic version avaliable as
part of the Eindhoven Corpus.
4 Geerts,
G. & T. den Boon (1999). Van Dale Groot Woordenboek der Nederlandse Taal.
Utrecht/Antwerpen: Van Dale Lexicografie.