All word forms (tokens) in the corpus have been lemmatised. In the process of lemmatisation the word-for-word principle was adopted. For situations where a token was part of a multi-word unit, or part of a discontinuous verb or preposition, the same principle applied. It was not until later that, in a separate phase that we shall here refer to as lexicon link-up, these items were lemmatised as multiword units. Via these lemmas reference to the lexicon was made so as to enable the search of the parts of multiword expressions.
Below we discuss in some detail the
aim and motivation for the lexicon link-up. Attention is also given to
the protocol and the procedure that were developed. We also provide information
about the file types and formats. Finally, an overview is presented of
the data that are available in version 1.0.
Read more about
The following multi-word expression are distinguished:
An inventory was made of all the possibly discontinuous verbs and foreign expressions that occurred in various lexical resources and in the corpus. This inventory served to automatically mark instances in the corpus as possible multi-word expressions. The same procedure was adopted for all continuous sequences of words starting with a capital letter (these were candidates for being identified as compound proper names or titles). A test was conducted involving a subset of the corpus comprising some 1 million tokens. All multi-word units were checked. As a result, we were able to further improve the protocol.
Finally for the entire corpus all
candidate multi-word expressions were identified automatically. The output
of the automatic process was then subjected to manual verification. In
the POS annotation files and the output files of the lexicon link-up, the
tag and the .lxk files, numerical codes were introduced that
relate to the lemmas of multi-word expressions.
Piepenbrock, R. 2004. Taalkundig
protocol voor de lexicologische koppeling. (Here available in .ps
and .pdf
format; Dutch only.)
Table 1. Overview of available
data (VL = data originating from Flanders, NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
2,626,172
|
878,383 | 1,747,789 |
b.
|
Interviews with teachers of Dutch |
565,433
|
315,554 | 249,879 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
1,208,633
|
465,096
|
743,537
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
853,371
|
343,167 |
510,204
|
e.
|
Simulated business negotiations |
136,461
|
0 | 136,461 |
f. | Interviews/discussions/debates (broadcast) |
790,269
|
250,708 | 539,561 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
360,328
|
138,819
|
221,509 |
h.
|
Lessons recorded in the classroom |
405,409
|
105,436
|
299,973
|
i.
|
Live (eg sports) commentaries (broadcast) |
208,399
|
78,022 | 130,377 |
j.
|
Newsreports/reportages (broadcast) |
186,072
|
95,206 | 90,866 |
k.
|
News (broadcast) |
368,153
|
82,855 | 285,298 |
l.
|
Commentaries/columns/reviews (broadcast) |
145,553
|
65,386 | 80,167 |
m.
|
Ceremonious speeches/sermons |
18,075
|
12,510 | 5,565 |
n.
|
Lectures/seminars |
140,901
|
79,067 | 61,834 |
o.
|
Read speech | 903,043 | 351,419 | 551,624 |
Total |
8,916,272
|
3,261,628 | 5,654,644 |