The project aimed to design a corpus
that would constitute a plausible sample of contemporary standard Dutch
as spoken in Flanders and the Netherlands. One third of the data were to
be collected in Flanders, two thirds were to originate from the Netherlands.
The entire corpus was to be transcribed orthographically, lemmatized and
enriched with part-of-speech information. Users should be able to access
the speech recordings through pointers in the transcriptions. For a selection
of one million words it was envisaged that an auditorily verified, broad
phonetic transcription would be available, while for this part of the corpus
the automatic time alignment would be manually checked on the level of
the word. For most of the recordings which were not checked by hand the
pointers were expected to be accurate within less than 100 ms. Also for
one million words, a syntactic annotation was envisaged and 250,000 words
were to receive a prosodic annotation.
Original overall design
(autumn 1998)
The design of the corpus was guided by a number of considerations. First of all, there was the fact that the corpus was to serve many and diverse interests. Different user groups have different requirements when it comes to the quality and quantity of the data, the number and type of speakers, and so on. Second, the total budget available for the entire project was fixed at 4.6 MEuro, i.e. this should cover all costs involved in recording and collecting data, transcribing and annotating these data, etc. And finally, the issue of copyright complicated matters. Since the corpus was to be distributed including the speech files, the consent of all speakers was required as well as of any parties that had any rights to the recorded material.
The design of the corpus took into
account the various dimensions underlying the variation that can be observed
in language use. In the overall design of the corpus the principal parameter
was taken to be the socio-situational setting in which language is used.
This led us to distinguish a number of components, each of which could
be characterized in terms of its situational characteristics such as communicative
goal, medium, number of speakers participating, and the relationship between
speaker(s) and hearer(s).
The
specification of each of the components was given in terms of sample sizes,
total number of speakers, range of topics, etc. Where this was considered
to be of particular interest, speaker characteristics such as gender, age,
geographical region, and socio-economic class were used as (demographic)
sampling criteria; otherwise they were merely recorded as part of the meta-data.
The overall design of the corpus is given in Table 1.
Table 1. Original overall design of the corpus (autumn 1998)
Flanders | The Netherlands | ||||||
---|---|---|---|---|---|---|---|
dialogue
/
multilogue 8,110,000
|
private
6,635,000
|
unscripted
6,635,000
|
direct
3,460,000
|
conversations ('face-to-face')
3,000,000
|
1,000,000
|
2,000,000
|
|
interviews
460,000
|
230,000
|
230,000
|
|||||
distanced
3,175,000
|
telephone conversations
3,000,000
|
1,000,000
|
2,000,000
|
||||
business transactions
175,000
|
0
|
175,000
|
|||||
public
1,475,000
|
broadcast
750,000
|
more or less
scripted
750,000
|
interviews and discussions
750,000
|
230,000
|
520,000
|
||
non-broadcast
725,000
|
unscripted
725,000
|
discuss., debates,meetings
375,000
|
130,000
|
245,000
|
|||
lectures
350,000
|
110,000
|
240,000
|
|||||
monologue
1,890,000
|
private
40,000
|
more or less
scripted
40,000
|
descriptions of pictures
40,000
|
40,000
|
0
|
||
public
1,850,000
|
broadcast
950,000
|
unscripted
250,000
|
spontaneous commentary
250,000
|
70,000
|
180,000
|
||
more
or less scripted
700,000
|
newsreports, current affairs programmes
250,000
|
80,000
|
170,000
|
||||
news
250,000
|
80,000
|
170,000
|
|||||
commentary
200,000
|
60,000
|
140,000
|
|||||
non-broadcast
900,000
|
more
or less scripted
900,000
|
lectures, speeches
275,000
|
95,000
|
180,000
|
|||
read aloud text
625,000 (+375,000)
|
210,000
(+125,000)
|
415,000
(+250,000)
|
While the project was on-going, the design and considerations described above were takan as guidelines However, as the project progressed data collection of part of the data fell behind schedule. Therefore, half-way throught the project, it was decided to adapt the design somewhat. Certain components that had not yet (fully) been realised were reduced or cancelled. Then, as one came to the end of the project and the structure of the final release was being considered, it was found that a re-structuring of the corpus would be in the interest of the user. The structure of the corpus as it is distributed in the present version is represented in Table 2.
Table 2. Components distinguished
in the Spoken Dutch Corpus (version 1.0)
Componenten: | |
---|---|
a.
|
Spontaneous conversations ('face-to-face') |
b.
|
Interviews with teachers of Dutch |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
d. | Spontaneous telephone dialogues (recorded on MD via a local interface) |
e. | Simulated business negotiations |
f. | Interviews/discussions/debates (broadcast) |
g. | (political) Discussions/debates/meetings (non-broadcast) |
h.
|
Lessons recorded in the classroom |
i.
|
Live (eg sports) commentaries (broadcast) |
j.
|
Newsreports/reportages (broadcast) |
k.
|
News (broadcast) |
l.
|
Commentaries/columns/reviews (broadcast) |
m.
|
Ceremonious speeches/sermons |
n.
|
Lectures/seminars |
o.
|
Read speech |
This is not the place to discuss in detail the sampling procedure that was employed with each component. Here we restrict ourselves to giving a short overview of the different sampling criteria and the (possible) ways in which they have been applied. Please note that not all sampling criteria apply to all components.
Sample size
For the entire corpus it is true that a sample is a stretch of connected discourse. The sizes of the different samples differ. In a number of instances, eg for the samples making up component o (read speech), a minimum size was specified so as to meet the requirements specified by users from a particular field. On the whole, natural break-off points such as changes of turn, changes of item (in a news broadcast), etc. have been used to delimited the samples.
Number of speakers per component
In principle the number of speakers may vary. For a number of components, viz. the spontaneous conversations (component a), the interviews (component b), the telephone dialogues (components c and d) and the read aloud text (component o), the number of speakers was specified beforehand.
Speaker characteristics
Speaker characteristics that have played a role as sampling criteria are sex, age, geographical region, socio-economic class and level of education.
Quality of the recording
The quality of the recordings varies.
Of course high quality was aimed for. However, recording conditions were
rather varied so that not in all cases is the quality equally high.
For an overview of the data that are
available and their distribution over various components, we refer to the
overview of available data.
Selections for which
more advanced annotations were envisaged (autumn 1998)
Once the overall design of the corpus had been established, it remained to be decided which part(s) of the corpus should be included in the selection of one million words (or 250,000 words in the case of prosodic annotation) for which more advanced annotations were envisaged. Preferably, the selection should in some way reflect the composition of the full corpus. While it would have been straightforward to simply select 10 per cent of each component, there were two powerful arguments that were raised against this procedure. First, there was the given fact that some user groups required certain minimum amounts of data with specific higher level (or more advanced) annotations that exceeded the 10 per cent norm. Second, not all types of data could be annotated with the same rate of success and/or at the same expense. Therefore, in the light of the quality standards that were upheld and the time and money available, certain types of data were given priority over other types. The selections that were decided upon for each type of advanced annotation are displayed in Table 2.
Table 3 gives an overview of the selections of parts of the corpus for which more advanced annotations were envisaged. The fourteen components that were distinguished here were the same as the ones referred to in the overall design. For each component it was indicated which part would be enriched with which types of annotations. Note that in the table only the size of each component is indicated (in number of words). The specific design of each component and the selection of samples depended on the quality of the speech signal, the distribution over various situational contexts, speakers, topics, etc.
Table 3. Selection of data for which more advanced transcriptions and annotations were envisaged (autumn 1998)
Component | Total
number of words
in the corpus |
Amount of
data and types of annotation
(in no. of words) |
|||
---|---|---|---|---|---|
+ alignment |
annotation |
annotation |
|||
1.
|
conversations ('face-to-face') |
3,000,000
|
150,000
|
550,000
|
100,000
|
2.
|
interviews |
460,000
|
50,000
|
50,000
|
20,000
|
3.
|
telephone conversations |
3,000,000
|
300,000
|
100,000
|
50,000
|
4. | business transactions |
175,000
|
15,000
|
15,000
|
10,000
|
5. | interviews and discussions |
750,000
|
75,000
|
75,000
|
10,000
|
6. | discuss., debates, meetings |
375,000
|
35,000
|
35,000
|
10,000
|
7. | lectures |
350,000
|
35,000
|
35,000
|
0
|
8.
|
descriptions of pictures |
40,000
|
5,000
|
5,000
|
0
|
9.
|
spontaneous commentary |
250,000
|
27,500
|
27,500
|
10,000
|
10.
|
newsreports, current affairs programmes |
250,000
|
25,000
|
25,000
|
10,000
|
11.
|
news |
250,000
|
27,500
|
27,500
|
10,000
|
12.
|
commentary |
200,000
|
25,000
|
25,000
|
10,000
|
13.
|
lectures, speeches |
275,000
|
30,000
|
30,000
|
10,000
|
14.
|
read aloud text |
625,000
(+ 375,000)
|
200,000
|
0
|
0
|
Total |
10,000,000
|
1,000,000
|
1,000,000
|
250,000
|
Within the project the targets that were set for the core corpus (selections of data for which additional transcription/annotations were provided) were met. For an overview, we refer to the overview of data with additional transcriptions and annotations.