The CGN Multi-word lexicon

	1 March 2004

	Richard Piepenbrock
	Mila Groot
	Raffaela Vlot
	Maarten Jansonius

General information

The CGN Multi-word lexicon, as available in version 1.0 of the Spoken Dutch Corpus, is based on an inventory of all multi-word expressions that occur in a number of sources (CELEX ¹, RBN ², Woordenlijst Nederlandse Taal (Groene Boekje, 1995), Corpus Uit den Boogaart ³) and the Van Dale Groot Woordenboek der Nederlandse Taal ⁴), complemented with all multi-word expressions that were encountered in the Corpus. The lexicon only includes multi-word expressions that occur in the corpus.

Format and contents of the CGN Multi-word lexicon

The lexicon is available in two formats:

A standard text file (flat ASCII) with the name cgnlex.txt. The backslash ('\') is used as separator. Letters with diacritics are represented in SGML format. This file can be opened by means of a simple text editor, or a database system such as Access, ORACLE or dBase.
An XML file with the name cgnmlex.lex. This file can be opened by any XML browser or editor, and then searched for certain values. The associated DTD (Document Type Definition) mlex.dtd is also avaliable.

The Multi-word lexifcon comprises 11 columns. Both lexicon files have been ordered occording to the orthographic multi-word (Orthografie Meerwoord) and then the multi-word part of speech (Woordsoort Meerwoord), the ID number of the multi-word lemma (Id-Nummer Meerwoordslemma) and the rank number of the members of the multi-word expression (Volgnummer van de leden binnen de meerwoordsuitdrukking).

Number of unique multi-word expressions	23,567
Number of unique multi-word lemmas	18,593
Number of entries for multi-words	53,704

Contents of the lexicon fields

CGN_MLEXICON.Orthografie Meerwoord ::= ([0-9][A-Z][a-z][ &'*-;])+

"&" + capital/small letter + diacritic + ";"

In concreto:

"&" + "a" + "grave" + ";"

"c" "acute" (= aigu)

"e" "circ" (= circonflexe)

"i" "uml" (= trema)

"n" "cedil" (= cedille)

"o" "tilde"

"u" "ring"

"A"

"C"

"E"

"I"

"N"

"O"

"U"

b.v. 'à la carte' voor 'à la carte'

en

'Gustaf Åkermans' voor 'Gustaf Åkermans'

The SGML symbol '&' is used to represent the ampersand ('&').

CGN_MLEXICON.Volgnummer ::= [1-9]+

CGN_MLEXICON.Orthografie Woordvorm ::= ([0-9][A-Z][a-z][&'-;])+

CGN_MLEXICON.Woordsoort Woordvorm ::=

"ADJ(" value ("," value)* ")" |

"BW("")" |

"LID(" value ("," value)* ") |

"N(" value ("," value)* ")" |

"SPEC(deeleigen)" |

"SPEC(meta)" |

"SPEC(onverst)" |

"SPEC(vreemd)" |

"TSW()" |

"TW(" value ("," value)* ")" |

"VG(" value ")" |

"VNW(" value ("," value)* ")" |

"VZ(" value ")" |

"WW(" value ("," value)* ")"

Part of Speech Tagging en Lemmatisering

ADJ

adjectief (= adjective)

BW

bijwoord (= adverb)

LID

lidwoord (= article)

N

substantief (= noun)

SPEC(deeleigen)

code for part of a compound proper name

SPEC(meta)

code for a mention

SPEC(onverst)

code for an incomprehensible utterance

SPEC(vreemd)

code for an utterance in a foreign language

TSW

tussenwerpsel (= interjection)

TW

telwoord (= numeral)

VG

voegwoord (= conjunction)

VNW

voornaamwoord (= pronoun)

VZ

voorzetsel (= preposition)

WW

werkwoord (= verb)

CGN_MLEXICON.Woordsoort Meerwoord ::=

COMB(eigen)

code for compound proper name or title
Warning: this field has only been included in the text version of the lexicon, viz. cgnmlex.txt (and not in the XML version cgnmlex.lex). It is a provisional code that may be subject to change in the future.

CGN_LEXICON.Id-Nummer Meerwoordslemma: ::= [0-9]+

Definitie Meerwoordslemma

CGN_MLEXICON.Meerwoordslemma ::= ([0-9][A-Z][a-z][&'*-;_])*

pro forma\1\pro\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\
pro forma\2\forma\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\

Kim Clijsters\1\Kim\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J\
Kim Clijsters\2\Clijsters\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J

CGN_LEXICON.Morfologie Meerwoordslemma

Overview of word class codes:

N = substantief (= noun)
A = adjectief (= adjective)
Q = telwoord (= numeral)
V = werkwoord (= verb)
D = lidwoord (= article)
O = voornaamwoord (= pronoun)
B = bijwoord ( = adverb)
P = voorzetsel (= preposition)
C = voegwoord (= conjunction)
I = tussenwerpsel (= interjection)
X = restcategorie (= rest category)
. = affix (= affix)
x = deel van discontinu affix (= part of a discontinuous affix)

voorverwarmen ((voor)[B],((ver)[V|.A],(warm)[A])[V])[V]

dichtmaken:

((dicht)[A],(maak)[V])[V]

navertellen:

((na)[P],((ver)[V|.V],(tel)[V])[V])[V]

achteruitdeinzen:

(((achter)[B],(uit)[B])[B],(deins)[V])[V]

CGN_LEXICON.Definitie Meerwoordslemma

loopt door\WW(pv,tgw,met-t)\501446\doorlopen\((door)[B],(loop)[V])[V]\verder lopen, vermengen van kleuren\J\N\

CGN_MLEXICON.Optioneel lid ::= "J" | "N"

CGN_MLEXICON.Continu meerwoord ::= "J" | "N"

¹ Centrum voor Lexicale Informatie. Interfacultaire Werkgroep Taal en Spraak, Universiteit van Nijmegen & Max Planck Instituut voor Psycholinguïstiek, Nijmegen.

² Referentiebestand Nederlands. Vakgroep Lexicologie, Vrije Universiteit Amsterdam & Instituut voor Nederlandse Lexicologie, Leiden & Departement Linguïstiek, Katholieke Universiteit Leuven & Vakgroep Nederlands, Universiteit Utrecht.

³ Boogaart, P.C. Uit den (1975). Woordfrequenties: in Geschreven en Gesproken Nederlands. Utrecht: Oosthoek, Scheltema & Holkema. Electronic version avaliable as part of the Eindhoven Corpus.

⁴ Geerts, G. & T. den Boon (1999). Van Dale Groot Woordenboek der Nederlandse Taal. Utrecht/Antwerpen: Van Dale Lexicografie.

"&" +	"a" +	"grave"	+ ";"
	"c"	"acute" (= aigu)
	"e"	"circ" (= circonflexe)
	"i"	"uml" (= trema)
	"n"	"cedil" (= cedille)
	"o"	"tilde"
	"u"	"ring"
	"A"
	"C"
	"E"
	"I"
	"N"
	"O"
	"U"
b.v.	'à la carte' voor 'à la carte'
	en
	'Gustaf Åkermans' voor 'Gustaf Åkermans'