Design and motivation

The project aimed to design a corpus that would constitute a plausible sample of contemporary standard Dutch as spoken in Flanders and the Netherlands. One third of the data were to be collected in Flanders, two thirds were to originate from the Netherlands. The entire corpus was to be transcribed orthographically, lemmatized and enriched with part-of-speech information. Users should be able to access the speech recordings through pointers in the transcriptions. For a selection of one million words it was envisaged that an auditorily verified, broad phonetic transcription would be available, while for this part of the corpus the automatic time alignment would be manually checked on the level of the word. For most of the recordings which were not checked by hand the pointers were expected to be accurate within less than 100 ms. Also for one million words, a syntactic annotation was envisaged and 250,000 words were to receive a prosodic annotation.
 

Original overall design (autumn 1998)

The design of the corpus was guided by a number of considerations. First of all, there was the fact that the corpus was to serve many and diverse interests. Different user groups have different requirements when it comes to the quality and quantity of the data, the number and type of speakers, and so on. Second, the total budget available for the entire project was fixed at 4.6 MEuro, i.e. this should cover all costs involved in recording and collecting data, transcribing and annotating these data, etc. And finally, the issue of copyright complicated matters. Since the corpus was to be distributed including the speech files, the consent of all speakers was required as well as of any parties that had any rights to the recorded material.

The design of the corpus took into account the various dimensions underlying the variation that can be observed in language use. In the overall design of the corpus the principal parameter was taken to be the socio-situational setting in which language is used. This led us to distinguish a number of components, each of which could be characterized in terms of its situational characteristics such as communicative goal, medium, number of speakers participating, and the relationship between speaker(s) and hearer(s).
      The specification of each of the components was given in terms of sample sizes, total number of speakers, range of topics, etc. Where this was considered to be of particular interest, speaker characteristics such as gender, age, geographical region, and socio-economic class were used as (demographic) sampling criteria; otherwise they were merely recorded as part of the meta-data. The overall design of the corpus is given in Table 1.

Table 1. Original overall design of the corpus (autumn 1998)
      Flanders  The Netherlands
dialogue /
multilogue
8,110,000
private
6,635,000
unscripted
6,635,000
direct
   3,460,000
conversations ('face-to-face') 
3,000,000
1,000,000
2,000,000
interviews 
460,000
230,000
230,000
distanced
   3,175,000
telephone conversations
3,000,000
1,000,000
2,000,000
business transactions
175,000
0
175,000
public
1,475,000
broadcast
750,000
more or less scripted
750,000
interviews and discussions
750,000
230,000
520,000
non-broadcast
725,000
unscripted
725,000
discuss., debates,meetings
375,000
130,000
245,000
lectures
350,000
110,000
240,000
monologue
1,890,000
private
40,000
more or less scripted
40,000
descriptions of pictures
40,000
40,000
0
public
1,850,000
broadcast
950,000
unscripted
250,000
spontaneous commentary
250,000
70,000
180,000
more or less scripted
700,000
newsreports, current affairs programmes
250,000
80,000
170,000
news
250,000
80,000
170,000
commentary
200,000
60,000
140,000
non-broadcast
900,000
more or less scripted
900,000
lectures, speeches
275,000
95,000
180,000
read aloud text
625,000 (+375,000)
210,000
(+125,000)
415,000
(+250,000)
In all, 14 different components were distinguished. The total number of words varied from component to component. Since not for all components a full specification was available as yet, the total number of words per component remained at this point somewhat arbitrary. For the time being, however, we assumed that no adaptations would be necessary. Considerations that played a role in determining the present sizes of the components were the following:

Actual realisation (version 1.0)

While the project was on-going, the design and considerations described above were takan as guidelines However, as the project progressed data collection of part of the data fell behind schedule. Therefore, half-way throught the project, it was decided to adapt the design somewhat. Certain components that had not yet (fully) been realised were reduced or cancelled. Then, as one came to the end of the project and the structure of the final release was being considered, it was found that a re-structuring of the corpus would be in the interest of the user. The structure of the corpus as it is distributed in the present version is represented in Table 2.

Table 2. Components distinguished in the Spoken Dutch Corpus (version 1.0)
 
Componenten:
a.
Spontaneous conversations ('face-to-face')
b.
Interviews with teachers of Dutch
c.
Spontaneous telephone dialogues (recorded via a switchboard)
  d. Spontaneous telephone dialogues (recorded on MD via a local interface)
  e. Simulated business negotiations
  f. Interviews/discussions/debates (broadcast)
  g. (political) Discussions/debates/meetings (non-broadcast)
h.
Lessons recorded in the classroom
i.
Live (eg sports) commentaries (broadcast)
j.
Newsreports/reportages (broadcast)
k.
News (broadcast)
l.
Commentaries/columns/reviews (broadcast)
m.
Ceremonious speeches/sermons
n.
Lectures/seminars
o.
Read speech

This is not the place to discuss in detail the sampling procedure that was employed with each component. Here we restrict ourselves to giving a short overview of the different sampling criteria and the (possible) ways in which they have been applied. Please note that not all sampling criteria apply to all components.

Sample size

For the entire corpus it is true that a sample is a stretch of connected discourse. The sizes of the different samples differ. In a number of instances, eg for the samples making up component o (read speech), a minimum size was specified so as to meet the requirements specified by users from a particular field. On the whole, natural break-off points such as changes of turn, changes of item (in a news broadcast), etc. have been used to delimited the samples.

Number of speakers per component

In principle the number of speakers may vary. For a number of components, viz. the spontaneous conversations (component a), the interviews (component b), the telephone dialogues (components c and d) and the read aloud text (component o), the number of speakers was specified beforehand.

Speaker characteristics

Speaker characteristics that have played a role as sampling criteria are sex, age, geographical region, socio-economic class and level of education.

Quality of the recording

The quality of the recordings varies. Of course high quality was aimed for. However, recording conditions were rather varied so that not in all cases is the quality equally high. For an overview of the data that are available and their distribution over various components, we refer to the overview of available data.
 

Selections for which more advanced annotations were envisaged (autumn 1998)

Once the overall design of the corpus had been established, it remained to be decided which part(s) of the corpus should be included in the selection of one million words (or 250,000 words in the case of prosodic annotation) for which more advanced annotations were envisaged. Preferably, the selection should in some way reflect the composition of the full corpus. While it would have been straightforward to simply select 10 per cent of each component, there were two powerful arguments that were raised against this procedure. First, there was the given fact that some user groups required certain minimum amounts of data with specific higher level (or more advanced) annotations that exceeded the 10 per cent norm. Second, not all types of data could be annotated with the same rate of success and/or at the same expense. Therefore, in the light of the quality standards that were upheld and the time and money available, certain types of data were given priority over other types. The selections that were decided upon for each type of advanced annotation are displayed in Table 2.

Table 3 gives an overview of the selections of parts of the corpus for which more advanced annotations were envisaged. The fourteen components that were distinguished here were the same as the ones referred to in the overall design. For each component it was indicated which part would be enriched with which types of annotations. Note that in the table only the size of each component is indicated (in number of words). The specific design of each component and the selection of samples depended on the quality of the speech signal, the distribution over various situational contexts, speakers, topics, etc.

Table 3. Selection of data for which more advanced transcriptions and annotations were envisaged (autumn 1998)
Component Total number of words 
in the corpus 
Amount of data and types of annotation 
(in no. of words)
phon. transcr. 
+ alignment
syntactic 
annotation
prosodic 
annotation
1.
conversations ('face-to-face')
3,000,000
150,000
550,000
100,000
2.
interviews
460,000
50,000
50,000
20,000
3.
telephone conversations
3,000,000
300,000
100,000
50,000
  4. business transactions
175,000
15,000
15,000
10,000
  5. interviews and discussions
750,000
75,000
75,000
10,000
  6. discuss., debates, meetings
375,000
35,000
35,000
10,000
  7. lectures
350,000
35,000
35,000
0
8.
descriptions of pictures
40,000
 5,000
5,000
0
9.
spontaneous commentary
250,000
27,500
27,500
10,000
10.
newsreports, current affairs programmes
250,000
25,000
25,000
10,000
11.
news
250,000
27,500
27,500
10,000
12.
commentary
200,000
25,000
25,000
10,000
13.
lectures, speeches
275,000
30,000
30,000
10,000
14.
read aloud text
625,000 
(+ 375,000)
200,000
0
0
Total
10,000,000
1,000,000
1,000,000
250,000

Actual realisation (version 1.0)

Within the project the targets that were set for the core corpus (selections of data for which additional transcription/annotations were provided) were met. For an overview, we refer to the overview of data with additional transcriptions and annotations.