The entire corpus was automatically segmented on the level of the word, that is, for each word in the corpus an alignment with the speech signal is made in terms of points in time. The segmentation is also available on phoneme level when the segmentation was based on an automatic phonetic transcription. The part of the corpus for which a manually verified broad phonetic transcription was available, was also segmented on the level of the word and this segmentation was manually verified. For this part the phoneme segmentations are not available.
Below information is given on the goals, the protocol, the procedure ,and the file types and formats. Finally, an overview is presented of the data that are available in version 1.0 of the corpus.
Read more about
The main goal of this annotation layer is to separate words acoustically by means of marks in the speech signal. These marks must be set such a way that the speech signal they confine should contain the word and nothing more than this word. The separated words should ideally be recognised as such and should sound acoustically acceptable.
The word segmentation is useful for quick access to the data, when one wants to hear the acoustic realisation of a certain word. Besides this, the manually checked word segmentation can be considered a reliable source for training an automatic speech recognizer for which the first segmentation already exists. Finally, the word segmentation establishes a one-to-one link between the orthographic words and its phonetic counterpart. The link is established in terms of markings in the speech signal. For the part of the corpus that was enriched with a manual verification of the word segmentation, a manual broad phonetic transcriptions was available as well.
For each phoneme in the either manually or automatically created broad phonetic transcription an automatic speech recognizer links it to an interval in the speech signal that corresponds to that phoneme. The word segmentation were derived from these phoneme segments. More information about the procedure can be found in Martens et al. (2002).
Only when the phonetic transcriptions were created automatically (cf. Demuynck et al. 2002 and Cucchiarini et al. 2001), are the original phoneme segmentations also available. A part of the data received a manual phonetic transcription (see here). It is only for these data that manually verified word segmentations are available.
For the manual verification of the
word segmentation PRAAT was used. PRAAT allows you to both see (and
play) the speech signal and the transcription tiers in which, in this case,
words are displayed separated by markers. These markers can easily be dragged
to the right position (if necessary) by using the mouse.
References:
Martens, J.P. , D. Binnenpoorte, K. Demuynck, R. van Parys, T. Laureys, W. Goedertier & J. Duchateau 2002. Word Segmentation in the Spoken Dutch Corpus, in Proceedings of LREC2002, Las Palmas de Gran Canaria, Spain.
Demuynck, K., T. Laureys & S. Gillis. 2002. Automatic Generation of Phonetic Transcriptions for Large Speech Corpora. In Proceedings International Conference on Spoken Language Processing. Vol. 1: 333-336. Denver, USA.
Cucchiarini, C., D. Binnenpoorte & S. Goddijn. 2001. Phonetic Transcriptions in the Spoken Dutch Corpus: how to combine efficiency and good transcription quality. In Proceedings Eurospeech 2001. Aalborg, Denmark. pp. 1679-1682
A protocol (Binnenpoorte, 2002) was written in order to make sure that the manual verification of the word segmentation happened at least as consistent as possible. In order to achieve this, several guidelines were formulated. The most important of these were:
Binnenpoorte, D. 2002. PProtocol voor manuele verificatie van automatisch gegenereerde woordsegmentaties. (Available here in .ps and .pdf format; Dutch only.)
The word segmentation files are stored
in the following way:
An extensive description of the
abovementioned file formats can be found in wrd
format and the awd format,
the bpt
format and the skp formats
In Table 1 an overview is presented
of the data that are available in version 1.0. For a more detailed description
of the corpus design and the motivation for this design, we refer to the
corpus design and motivation.
Table 1. Overview of available
data (VL = data originating from Flanders; NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
177,127
|
70,945 | 106,182 |
b.
|
Interviews with teachers of Dutch |
59,751
|
34,064 | 25,687 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
270,027
|
68,886
|
201,141
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
6,257
|
6,257 |
0
|
e.
|
Simulated business negotiations |
25,485
|
0 | 25,485 |
f. | Interviews/discussions/debates (broadcast) |
100,250
|
25,144 | 75,106 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
34,126
|
9,009
|
25,117 |
h.
|
Lessons recorded in the classroom |
36,064
|
10,103
|
25,961
|
i.
|
Live (eg sports) commentaries (broadcast) |
35,116
|
10,130 | 24,986 |
j.
|
Newsreports/reportages (broadcast) |
32,744
|
7,679 | 25,065 |
k.
|
News (broadcast) |
32,601
|
7,305 | 25,296 |
l.
|
Commentaries/columns/reviews (broadcast) |
32,502
|
7,431 | 25,071 |
m.
|
Ceremonious speeches/sermons |
7,077
|
1,893 | 5,184 |
n.
|
Lectures/seminars |
23,056
|
8,143 | 14,913 |
o.
|
Read speech | 135,071 | 64,848 | 70,223 |
Total |
1,007,254
|
331,837 | 675,417 |
For all data in the corpus also an
automatic word segmentation is available, including the phoneme segmentation.
Information about the amount of data and their characteristics can be found
in the table onder orthographic
transcription.