The entire corpus was lemmatised. For the lemmatisation a lemmatiser was used. The output was checked and where necessary corrected manually.
Below the lemmatisation of the data
in the Spoken Dutch Corpus is described in some detail. Attention is given
to the aims pursued, the protocol that was developed as well as the procedure
that was adopted. We also provide information with regard to the file types
and formats. Finally, an overview is given of the data that are available
in version 1.0.
Read more about
Van Eynde, F. 2003. Protocol voor
POS tagging en lemmatisering. (Here available in .pdf
format; Dutch only.)
The lemmatisation is stored together with the POS tagging in the following files:
Table 1. Overview of available
data (VL = data originating from Flanders, NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
2,626,172
|
878,383 | 1,747,789 |
b.
|
Interviews with teachers of Dutch |
565,433
|
315,554 | 249,879 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
1,208,633
|
465,096
|
743,537
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
853,371
|
343,167 |
510,204
|
e.
|
Simulated business negotiations |
136,461
|
0 | 136,461 |
f. | Interviews/discussions/debates (broadcast) |
790,269
|
250.708 | 539.561 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
360,328
|
138,819
|
221,509 |
h.
|
Lessons recorded in the classroom |
405,409
|
105,436
|
299,973
|
i.
|
Live (eg sports) commentaries (broadcast) |
208,399
|
78,022 | 130,377 |
j.
|
Newsreports/reportages (broadcast) |
186,072
|
95,206 | 90,866 |
k.
|
News (broadcast) |
368,153
|
82,855 | 285,298 |
l.
|
Commentaries/columns/reviews (broadcast) |
145,553
|
65,386 | 80,167 |
m.
|
Ceremonious speeches/sermons |
18,075
|
12,510 | 5,565 |
n.
|
Lectures/seminars |
140,901
|
79,067 | 61,834 |
o.
|
Read speech | 903,043 | 351,419 | 551,624 |
Total |
8,916,272
|
3,261,628 | 5,654,644 |