ontwerp en motivatie

Design and motivation

The project aimed to design a corpus that would constitute a plausible sample of contemporary standard Dutch as spoken in Flanders and the Netherlands. One third of the data were to be collected in Flanders, two thirds were to originate from the Netherlands. The entire corpus was to be transcribed orthographically, lemmatized and enriched with part-of-speech information. Users should be able to access the speech recordings through pointers in the transcriptions. For a selection of one million words it was envisaged that an auditorily verified, broad phonetic transcription would be available, while for this part of the corpus the automatic time alignment would be manually checked on the level of the word. For most of the recordings which were not checked by hand the pointers were expected to be accurate within less than 100 ms. Also for one million words, a syntactic annotation was envisaged and 250,000 words were to receive a prosodic annotation.

Original overall design (autumn 1998)

The design of the corpus was guided by a number of considerations. First of all, there was the fact that the corpus was to serve many and diverse interests. Different user groups have different requirements when it comes to the quality and quantity of the data, the number and type of speakers, and so on. Second, the total budget available for the entire project was fixed at 4.6 MEuro, i.e. this should cover all costs involved in recording and collecting data, transcribing and annotating these data, etc. And finally, the issue of copyright complicated matters. Since the corpus was to be distributed including the speech files, the consent of all speakers was required as well as of any parties that had any rights to the recorded material.

The design of the corpus took into account the various dimensions underlying the variation that can be observed in language use. In the overall design of the corpus the principal parameter was taken to be the socio-situational setting in which language is used. This led us to distinguish a number of components, each of which could be characterized in terms of its situational characteristics such as communicative goal, medium, number of speakers participating, and the relationship between speaker(s) and hearer(s).
The specification of each of the components was given in terms of sample sizes, total number of speakers, range of topics, etc. Where this was considered to be of particular interest, speaker characteristics such as gender, age, geographical region, and socio-economic class were used as (demographic) sampling criteria; otherwise they were merely recorded as part of the meta-data. The overall design of the corpus is given in Table 1.

Table 1. Original overall design of the corpus (autumn 1998)

      Flanders The Netherlands

dialogue /
multilogue
8,110,000
private
6,635,000
unscripted
6,635,000
direct
   3,460,000
conversations ('face-to-face')
3,000,000

1,000,000

2,000,000

interviews
460,000

230,000

230,000

distanced
   3,175,000
telephone conversations
3,000,000

1,000,000

2,000,000

business transactions
175,000

0

175,000

public
1,475,000
broadcast
750,000
more or less scripted
750,000
interviews and discussions
750,000

230,000

520,000

non-broadcast
725,000
unscripted
725,000
discuss., debates,meetings
375,000

130,000

245,000

lectures
350,000

110,000

240,000

monologue
1,890,000
private
40,000
more or less scripted
40,000
descriptions of pictures
40,000

40,000

0

public
1,850,000
broadcast
950,000
unscripted
250,000
spontaneous commentary
250,000

70,000

180,000

more or less scripted
700,000
newsreports, current affairs programmes
250,000

80,000

170,000

news
250,000

80,000

170,000

commentary
200,000

60,000

140,000

non-broadcast
900,000
more or less scripted
900,000
lectures, speeches
275,000

95,000

180,000

read aloud text
625,000 (+375,000)

210,000

(+125,000)

415,000

(+250,000)

In all, 14 different components were distinguished. The total number of words varied from component to component. Since not for all components a full specification was available as yet, the total number of words per component remained at this point somewhat arbitrary. For the time being, however, we assumed that no adaptations would be necessary. Considerations that played a role in determining the present sizes of the components were the following:

	Flanders	The Netherlands
dialogue / multilogue 8,110,000	private 6,635,000	unscripted 6,635,000	direct 3,460,000	conversations ('face-to-face') 3,000,000	1,000,000	2,000,000
interviews 460,000	230,000	230,000
distanced 3,175,000	telephone conversations 3,000,000	1,000,000	2,000,000
business transactions 175,000	0	175,000
public 1,475,000	broadcast 750,000	more or less scripted 750,000	interviews and discussions 750,000	230,000	520,000
non-broadcast 725,000	unscripted 725,000	discuss., debates,meetings 375,000	130,000	245,000
lectures 350,000	110,000	240,000
monologue 1,890,000	private 40,000	more or less scripted 40,000	descriptions of pictures 40,000	40,000	0
public 1,850,000	broadcast 950,000	unscripted 250,000	spontaneous commentary 250,000	70,000	180,000
more or less scripted 700,000	newsreports, current affairs programmes 250,000	80,000	170,000
news 250,000	80,000	170,000
commentary 200,000	60,000	140,000
non-broadcast 900,000	more or less scripted 900,000	lectures, speeches 275,000	95,000	180,000
read aloud text 625,000 (+375,000)	210,000 (+125,000)	415,000 (+250,000)

there was a great demand for spontaneously spoken language data; this explained the overall bias towards unscripted language;
interaction was considered to be a typical characteristic of spoken communication; therefore it was felt that dialogues and multilogues should be amply represented in the data;
certain language varieties display a great deal more variation than others; in order to capture this variation, more heterogeneous components generally were represented in the corpus by a larger number of samples than the more homogeneous ones;
the sample size differed from component to component; while it was impossible to know what the optimum sample size was, intuitive judgements were brought into play when it came to deciding what constituted an appropriate sample. Here the 'natural' length of a spoken text also played a role: an item in a radio news broadcast is per definition shorter than the spoken commentary in a television documentary;
some types of data were easier to collect than other types of data;
in order to meet the needs of particular user groups some components required a certain minimum amount of data; this was especially true for components that were envisaged to be used for the development of technological applications such as telephone conversations and read aloud text.

Actual realisation (version 1.0)

While the project was on-going, the design and considerations described above were takan as guidelines However, as the project progressed data collection of part of the data fell behind schedule. Therefore, half-way throught the project, it was decided to adapt the design somewhat. Certain components that had not yet (fully) been realised were reduced or cancelled. Then, as one came to the end of the project and the structure of the final release was being considered, it was found that a re-structuring of the corpus would be in the interest of the user. The structure of the corpus as it is distributed in the present version is represented in Table 2.

Table 2. Components distinguished in the Spoken Dutch Corpus (version 1.0)

Componenten:

a.
Spontaneous conversations ('face-to-face')

b.
Interviews with teachers of Dutch

c.
Spontaneous telephone dialogues (recorded via a switchboard)

d. Spontaneous telephone dialogues (recorded on MD via a local interface)

e. Simulated business negotiations

f. Interviews/discussions/debates (broadcast)

g. (political) Discussions/debates/meetings (non-broadcast)

h.
Lessons recorded in the classroom

i.
Live (eg sports) commentaries (broadcast)

j.
Newsreports/reportages (broadcast)

k.
News (broadcast)

l.
Commentaries/columns/reviews (broadcast)

m.
Ceremonious speeches/sermons

n.
Lectures/seminars

o.
Read speech

Componenten:

a.	Spontaneous conversations ('face-to-face')
b.	Interviews with teachers of Dutch
c.	Spontaneous telephone dialogues (recorded via a switchboard)
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)
e.	Simulated business negotiations
f.	Interviews/discussions/debates (broadcast)
g.	(political) Discussions/debates/meetings (non-broadcast)
h.	Lessons recorded in the classroom
i.	Live (eg sports) commentaries (broadcast)
j.	Newsreports/reportages (broadcast)
k.	News (broadcast)
l.	Commentaries/columns/reviews (broadcast)
m.	Ceremonious speeches/sermons
n.	Lectures/seminars
o.	Read speech

This is not the place to discuss in detail the sampling procedure that was employed with each component. Here we restrict ourselves to giving a short overview of the different sampling criteria and the (possible) ways in which they have been applied. Please note that not all sampling criteria apply to all components.

Sample size

For the entire corpus it is true that a sample is a stretch of connected discourse. The sizes of the different samples differ. In a number of instances, eg for the samples making up component o (read speech), a minimum size was specified so as to meet the requirements specified by users from a particular field. On the whole, natural break-off points such as changes of turn, changes of item (in a news broadcast), etc. have been used to delimited the samples.

Number of speakers per component

In principle the number of speakers may vary. For a number of components, viz. the spontaneous conversations (component a), the interviews (component b), the telephone dialogues (components c and d) and the read aloud text (component o), the number of speakers was specified beforehand.

Speaker characteristics

Speaker characteristics that have played a role as sampling criteria are sex, age, geographical region, socio-economic class and level of education.

Quality of the recording

The quality of the recordings varies. Of course high quality was aimed for. However, recording conditions were rather varied so that not in all cases is the quality equally high. For an overview of the data that are available and their distribution over various components, we refer to the overview of available data.

Selections for which more advanced annotations were envisaged (autumn 1998)

Once the overall design of the corpus had been established, it remained to be decided which part(s) of the corpus should be included in the selection of one million words (or 250,000 words in the case of prosodic annotation) for which more advanced annotations were envisaged. Preferably, the selection should in some way reflect the composition of the full corpus. While it would have been straightforward to simply select 10 per cent of each component, there were two powerful arguments that were raised against this procedure. First, there was the given fact that some user groups required certain minimum amounts of data with specific higher level (or more advanced) annotations that exceeded the 10 per cent norm. Second, not all types of data could be annotated with the same rate of success and/or at the same expense. Therefore, in the light of the quality standards that were upheld and the time and money available, certain types of data were given priority over other types. The selections that were decided upon for each type of advanced annotation are displayed in Table 2.

Table 3 gives an overview of the selections of parts of the corpus for which more advanced annotations were envisaged. The fourteen components that were distinguished here were the same as the ones referred to in the overall design. For each component it was indicated which part would be enriched with which types of annotations. Note that in the table only the size of each component is indicated (in number of words). The specific design of each component and the selection of samples depended on the quality of the speech signal, the distribution over various situational contexts, speakers, topics, etc.

Table 3. Selection of data for which more advanced transcriptions and annotations were envisaged (autumn 1998)

Component Total number of words
in the corpus Amount of data and types of annotation
(in no. of words)

phon. transcr.
+ alignment syntactic
annotation prosodic
annotation

1.
conversations ('face-to-face')
3,000,000

150,000

550,000

100,000

2.
interviews
460,000

50,000

50,000

20,000

3.
telephone conversations
3,000,000

300,000

100,000

50,000

4. business transactions
175,000

15,000

15,000

10,000

5. interviews and discussions
750,000

75,000

75,000

10,000

6. discuss., debates, meetings
375,000

35,000

35,000

10,000

7. lectures
350,000

35,000

35,000

0

8.
descriptions of pictures
40,000

5,000

5,000

0

9.
spontaneous commentary
250,000

27,500

27,500

10,000

10.
newsreports, current affairs programmes
250,000

25,000

25,000

10,000

11.
news
250,000

27,500

27,500

10,000

12.
commentary
200,000

25,000

25,000

10,000

13.
lectures, speeches
275,000

30,000

30,000

10,000

14.
read aloud text
625,000

(+ 375,000)

200,000

0

0

Total
10,000,000

1,000,000

1,000,000

250,000

Component	Total number of words in the corpus	Amount of data and types of annotation (in no. of words)
phon. transcr. + alignment	syntactic annotation	prosodic annotation
1.	conversations ('face-to-face')	3,000,000	150,000	550,000	100,000
2.	interviews	460,000	50,000	50,000	20,000
3.	telephone conversations	3,000,000	300,000	100,000	50,000
4.	business transactions	175,000	15,000	15,000	10,000
5.	interviews and discussions	750,000	75,000	75,000	10,000
6.	discuss., debates, meetings	375,000	35,000	35,000	10,000
7.	lectures	350,000	35,000	35,000	0
8.	descriptions of pictures	40,000	5,000	5,000	0
9.	spontaneous commentary	250,000	27,500	27,500	10,000
10.	newsreports, current affairs programmes	250,000	25,000	25,000	10,000
11.	news	250,000	27,500	27,500	10,000
12.	commentary	200,000	25,000	25,000	10,000
13.	lectures, speeches	275,000	30,000	30,000	10,000
14.	read aloud text	625,000 (+ 375,000)	200,000	0	0
Total	10,000,000	1,000,000	1,000,000	250,000

Actual realisation (version 1.0)

Within the project the targets that were set for the core corpus (selections of data for which additional transcription/annotations were provided) were met. For an overview, we refer to the overview of data with additional transcriptions and annotations.