Part of the corpus was annotated syntactically. For the annotation use was made of the Annotate program which has been developed by the University of Saarbrücken.
Below we discuss the syntactic annotation
of the Spoken Dutch Corpus and the aim and motivation for this type of
annotation. We also describe the protocol that was developed and the procedure
that was adopted. Information is also included about the file types and
formats that are available. Finally, an overview is presented of the data
that are available in version 1.0 of the corpus.
Read more about
The syntactic annotation of the data
was based on the following ideas (cf. Hoekstra et al. 2003: 4):
Input: On the input side we wanted the
annotation schemes to be as simple as possible so as to keep the workload
of the annotation and corretion of data down to a minimum.
Output: On the output side we wanted
to offer as rich an annotation as possible, in a format that could take
on different forms for various user groups.
In order to achieve this goal we decided to aim for a dependency analysis which was to a large extent theory-neutral. The primary annotation can be enriched with POS information and, through the lexicon link-up, with information from the CGN lexicon. The combination of the information contained in these three resources makes it possible to yield output that meets the needs of various user groups.
In order to facilitate and speed
up the annotation process, use was made of the Annotate software that has
been developed at the University of Saarbrücken. The Flemish data
were annotated in Leuven (CCL), for the Netherlands the syntactic annotation
was done by OTS, Utrecht. In the annotation process data were annotated
in multiple passes: after a first annotation was produced, the data had
to pass through a number of correction cycles. Checks were also made to
ensure consistency.
For the syntactic annotation of the
corpus a protocol was developed:
Hoekstra H., M. Moortgat, B. Renmans,
M. Schouppe, I. Schuurman & T. van der Wouden. 2003. CGN Syntactische
annotatie (Here available in .pdf
format.)
The syntactic annotations have been stored in the following files:
In Table 1 an overview is given of
the data that are available in version 1.0 of the corpus. For a description
of the design of the corpus and its motivation, we refer you to the description
of the corpus design.
Table 1. Overview of the data
for which a syntactic annotation is available
(VL = data originating from Flanders;
NL = data originating from The Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
447,113
|
146,745 | 300,368 |
b.
|
Interviews with teachers of Dutch |
59,751
|
34,064 | 25,687 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
89,819
|
19,886
|
69,933
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
6,257
|
6,257 |
0
|
e.
|
Simulated business negotiations |
25,485
|
0 | 25,485 |
f. | Interviews/discussions/debates (broadcast) |
100,250
|
25,144 | 75,106 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
34,126
|
9,009
|
25,117 |
h.
|
Lessons recorded in the classroom |
36,064
|
10,103
|
25,961
|
i.
|
Live (eg sports) commentaries (broadcast) |
35,116
|
10,130 | 24,986 |
j.
|
Newsreports/reportages (broadcast) |
32,744
|
7,679 | 25,065 |
k.
|
News (broadcast) |
32,689
|
7,305 | 25,384 |
l.
|
Commentaries/columns/reviews (broadcast) |
32,502
|
7,431 | 25,071 |
m.
|
Ceremonious speeches/sermons |
7,077
|
1,893 | 5,184 |
n.
|
Lectures/seminars |
23,056
|
8,143 | 14,913 |
o.
|
Read speech | 44,144 | 44,144 | 0 |
Total |
1,006,193
|
337,933 | 668,260 |