Lexicon link-up

All word forms (tokens) in the corpus have been lemmatised. In the process of lemmatisation the word-for-word principle was adopted. For situations where a token was part of a multi-word unit, or part of a discontinuous verb or preposition, the same principle applied. It was not until later that, in a separate phase that we shall here refer to as lexicon link-up, these items were lemmatised as multiword units. Via these lemmas reference to the lexicon was made so as to enable the search of the parts of multiword expressions.

Below we discuss in some detail the aim and motivation for the lexicon link-up. Attention is also given to the protocol and the procedure that were developed. We also provide information about the file types and formats. Finally, an overview is presented of the data that are available in version 1.0.

Aim and motivation

In order to be able to comply with existing annotation standards and also for more practical reasons (eg the automation of the annotation procedures as well as the synchronisation of different annotations), the word was taken as the base unit for annotation. However, since various grammatical and lexicological theories recognize the multi-word unit as a separate entity, we decided to mark some types of multi-word units. These multi-word units have been included in the lexicon and may be used by the user for the specification of complex searches.

The following multi-word expression are distinguished:

discontinuous:

verbs that can occur as discontinuous strings (eg opnemen, ademhalen)

continuous:

common (originally) foreign expressions (eg et cetera, wishful thinking)
native and non-native proper names and titles (eg Berg En Dal, Avril Lavigne, De Morgen, De Pfaffs)

Other multi-word structures have been considered for inclusion. These include compound prepositions (bij monde van, met het oog op) and idiomatic expressions that permit a certain variability (in mijn/je/haar/zijn ... nopjes zijn, een modderfiguur slaan). Because it is impossible to formulate the set of criteria by means of which these sets of items may be delimited, we have decided not to include these items as multi-word items. For two other categories, discontinuous prepositions (tussen ... in , van ... af) and discontinuous pronominal adverbs (er ... doorheen, daar ... mee), the lexicon link-up was considered to be to difficult because of the complex orthographical and grammatical relationship with the possibly discontinuous verbs. Multi-word units showing contraction (in- en uitvoer, probleemformulering of -oplossing) should be linked-up to the lexicon in a follow-up project.

Return to the top of this page.

Procedure

In order to facilitate the annotation process, we decided to transcribe and tag the samples first. Once the results had been verified, the orthographic transcription and the POS tags were used as input for the lexicon link-up.

An inventory was made of all the possibly discontinuous verbs and foreign expressions that occurred in various lexical resources and in the corpus. This inventory served to automatically mark instances in the corpus as possible multi-word expressions. The same procedure was adopted for all continuous sequences of words starting with a capital letter (these were candidates for being identified as compound proper names or titles). A test was conducted involving a subset of the corpus comprising some 1 million tokens. All multi-word units were checked. As a result, we were able to further improve the protocol.

Finally for the entire corpus all candidate multi-word expressions were identified automatically. The output of the automatic process was then subjected to manual verification. In the POS annotation files and the output files of the lexicon link-up, the tag and the .lxk files, numerical codes were introduced that relate to the lemmas of multi-word expressions.

Return to the top of this page.

Protocol

For the lexicon link-up a separate protocol was developed:

Piepenbrock, R. 2004. Taalkundig protocol voor de lexicologische koppeling. (Here available in .ps and .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The information that was added in the lexicon link-up phase has been stored in the following files:

lexical files of type .lex. The format of these files is XML. Multi-word expressions can be found in the file cgnmlex.lex. These file scan be found in the directory /data/lexicon/xml/ of the annotation DVD.
lexical files of type .txt. The format of these files is ASCII. Multi-word expressions can be found in the file cgnmlex.txt. These files can be found in the /data/lexicon/text/ of the annotation DVD
text files of type .lxk. In these files multi-word expressions have been identified and reference is made to the lexicons. The format of these files is XML. These files can be found in the directory /data/annot/xml/lxk/ of the annotation DVD
text files of type .tag. In these files multi-word expressions have been identified and reference is made to the lexicons. The format of these files is XML. These files can be found in the directory /data/annot/xml/tag/ of the annotation DVD

For descriptions of the formats mentioned above, please see the descriptions of the lex format and the lxk format.

Return to the top of this page.

Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
2,626,172
878,383 1,747,789

b.
Interviews with teachers of Dutch
565,433
315,554 249,879

c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633

465,096

743,537

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
343,167
510,204

e.
Simulated business negotiations
136,461
0 136,461

f. Interviews/discussions/debates (broadcast)
790,269
250,708 539,561

g.
(political) Discussions/debates/meetings (non-broadcast)
360,328

138,819
221,509

h.
Lessons recorded in the classroom
405,409

105,436

299,973

i.
Live (eg sports) commentaries (broadcast)
208,399
78,022 130,377

j.
Newsreports/reportages (broadcast)
186,072
95,206 90,866

k.
News (broadcast)
368,153
82,855 285,298

l.
Commentaries/columns/reviews (broadcast)
145,553
65,386 80,167

m.
Ceremonious speeches/sermons
18,075
12,510 5,565

n.
Lectures/seminars
140,901
79,067 61,834

o.
Read speech 903,043 351,419 551,624

Total
8,916,272
3,261,628 5,654,644

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	2,626,172	878,383	1,747,789
b.	Interviews with teachers of Dutch	565,433	315,554	249,879
c.	Spontaneous telephone dialogues (recorded via a switchboard)	1,208,633	465,096	743,537
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	853,371	343,167	510,204
e.	Simulated business negotiations	136,461	0	136,461
f.	Interviews/discussions/debates (broadcast)	790,269	250,708	539,561
g.	(political) Discussions/debates/meetings (non-broadcast)	360,328	138,819	221,509
h.	Lessons recorded in the classroom	405,409	105,436	299,973
i.	Live (eg sports) commentaries (broadcast)	208,399	78,022	130,377
j.	Newsreports/reportages (broadcast)	186,072	95,206	90,866
k.	News (broadcast)	368,153	82,855	285,298
l.	Commentaries/columns/reviews (broadcast)	145,553	65,386	80,167
m.	Ceremonious speeches/sermons	18,075	12,510	5,565
n.	Lectures/seminars	140,901	79,067	61,834
o.	Read speech	903,043	351,419	551,624
Total	8,916,272	3,261,628	5,654,644

Return to the top of this page.