Syntax

Syntax

Part of the corpus was annotated syntactically. For the annotation use was made of the Annotate program which has been developed by the University of Saarbrücken.

Below we discuss the syntactic annotation of the Spoken Dutch Corpus and the aim and motivation for this type of annotation. We also describe the protocol that was developed and the procedure that was adopted. Information is also included about the file types and formats that are available. Finally, an overview is presented of the data that are available in version 1.0 of the corpus.

Read more about

aim and motivation
procedure
protocol
file types and formats
overview of available data

Aim and motivation

The syntactic annotation of the data was based on the following ideas (cf. Hoekstra et al. 2003: 4): Input: On the input side we wanted the annotation schemes to be as simple as possible so as to keep the workload of the annotation and corretion of data down to a minimum.
Output: On the output side we wanted to offer as rich an annotation as possible, in a format that could take on different forms for various user groups.

In order to achieve this goal we decided to aim for a dependency analysis which was to a large extent theory-neutral. The primary annotation can be enriched with POS information and, through the lexicon link-up, with information from the CGN lexicon. The combination of the information contained in these three resources makes it possible to yield output that meets the needs of various user groups.

Return to the top of this page.

Procedure

In order to facilitate and speed up the annotation process, use was made of the Annotate software that has been developed at the University of Saarbrücken. The Flemish data were annotated in Leuven (CCL), for the Netherlands the syntactic annotation was done by OTS, Utrecht. In the annotation process data were annotated in multiple passes: after a first annotation was produced, the data had to pass through a number of correction cycles. Checks were also made to ensure consistency.

Return to the top of this page.

Protocol

For the syntactic annotation of the corpus a protocol was developed:

Hoekstra H., M. Moortgat, B. Renmans, M. Schouppe, I. Schuurman & T. van der Wouden. 2003. CGN Syntactische annotatie (Here available in .pdf format.)

Return to the top of this page.

File types and formats

The syntactic annotations have been stored in the following files:

files of type .syn. These are ASCII files. The files can be found in the directory /data/annot/text/syn/ of the annotation DVD
files of type .tig. The format of these files is XML format. These files can be found in the directory /data/annot/xml/tig/ of the annotation DVD

For the formats mentioned above, separate descriptions are available:

Return to the top of this page.

Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.

Table 1. Overview of the data for which a syntactic annotation is available
(VL = data originating from Flanders; NL = data originating from The Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
447,113
146,745 300,368

b.
Interviews with teachers of Dutch
59,751
34,064 25,687

c.
Spontaneous telephone dialogues (recorded via a switchboard)
89,819

19,886

69,933

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
6,257
0

e.
Simulated business negotiations
25,485
0 25,485

f. Interviews/discussions/debates (broadcast)
100,250
25,144 75,106

g.
(political) Discussions/debates/meetings (non-broadcast)
34,126

9,009
25,117

h.
Lessons recorded in the classroom
36,064

10,103

25,961

i.
Live (eg sports) commentaries (broadcast)
35,116
10,130 24,986

j.
Newsreports/reportages (broadcast)
32,744
7,679 25,065

k.
News (broadcast)
32,689
7,305 25,384

l.
Commentaries/columns/reviews (broadcast)
32,502
7,431 25,071

m.
Ceremonious speeches/sermons
7,077
1,893 5,184

n.
Lectures/seminars
23,056
8,143 14,913

o.
Read speech 44,144 44,144 0

Total
1,006,193
337,933 668,260

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	447,113	146,745	300,368
b.	Interviews with teachers of Dutch	59,751	34,064	25,687
c.	Spontaneous telephone dialogues (recorded via a switchboard)	89,819	19,886	69,933
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	6,257	6,257	0
e.	Simulated business negotiations	25,485	0	25,485
f.	Interviews/discussions/debates (broadcast)	100,250	25,144	75,106
g.	(political) Discussions/debates/meetings (non-broadcast)	34,126	9,009	25,117
h.	Lessons recorded in the classroom	36,064	10,103	25,961
i.	Live (eg sports) commentaries (broadcast)	35,116	10,130	24,986
j.	Newsreports/reportages (broadcast)	32,744	7,679	25,065
k.	News (broadcast)	32,689	7,305	25,384
l.	Commentaries/columns/reviews (broadcast)	32,502	7,431	25,071
m.	Ceremonious speeches/sermons	7,077	1,893	5,184
n.	Lectures/seminars	23,056	8,143	14,913
o.	Read speech	44,144	44,144	0
Total	1,006,193	337,933	668,260

Return to the top of this page.