Word segmentation

Word segmentation

The entire corpus was automatically segmented on the level of the word, that is, for each word in the corpus an alignment with the speech signal is made in terms of points in time. The segmentation is also available on phoneme level when the segmentation was based on an automatic phonetic transcription. The part of the corpus for which a manually verified broad phonetic transcription was available, was also segmented on the level of the word and this segmentation was manually verified. For this part the phoneme segmentations are not available.

Below information is given on the goals, the protocol, the procedure ,and the file types and formats. Finally, an overview is presented of the data that are available in version 1.0 of the corpus.

Read more about

aim and motivation
procedure
protocol
file types and formats
overview of available

Aim and motivation

The main goal of this annotation layer is to separate words acoustically by means of marks in the speech signal. These marks must be set such a way that the speech signal they confine should contain the word and nothing more than this word. The separated words should ideally be recognised as such and should sound acoustically acceptable.

The word segmentation is useful for quick access to the data, when one wants to hear the acoustic realisation of a certain word. Besides this, the manually checked word segmentation can be considered a reliable source for training an automatic speech recognizer for which the first segmentation already exists. Finally, the word segmentation establishes a one-to-one link between the orthographic words and its phonetic counterpart. The link is established in terms of markings in the speech signal. For the part of the corpus that was enriched with a manual verification of the word segmentation, a manual broad phonetic transcriptions was available as well.

Return to the top of this page.

Procedure

For each phoneme in the either manually or automatically created broad phonetic transcription an automatic speech recognizer links it to an interval in the speech signal that corresponds to that phoneme. The word segmentation were derived from these phoneme segments. More information about the procedure can be found in Martens et al. (2002).

Only when the phonetic transcriptions were created automatically (cf. Demuynck et al. 2002 and Cucchiarini et al. 2001), are the original phoneme segmentations also available. A part of the data received a manual phonetic transcription (see here). It is only for these data that manually verified word segmentations are available.

For the manual verification of the word segmentation PRAAT was used. PRAAT allows you to both see (and play) the speech signal and the transcription tiers in which, in this case, words are displayed separated by markers. These markers can easily be dragged to the right position (if necessary) by using the mouse.

References:

Martens, J.P. , D. Binnenpoorte, K. Demuynck, R. van Parys, T. Laureys, W. Goedertier & J. Duchateau 2002. Word Segmentation in the Spoken Dutch Corpus, in Proceedings of LREC2002, Las Palmas de Gran Canaria, Spain.

Demuynck, K., T. Laureys & S. Gillis. 2002. Automatic Generation of Phonetic Transcriptions for Large Speech Corpora. In Proceedings International Conference on Spoken Language Processing. Vol. 1: 333-336. Denver, USA.

Cucchiarini, C., D. Binnenpoorte & S. Goddijn. 2001. Phonetic Transcriptions in the Spoken Dutch Corpus: how to combine efficiency and good transcription quality. In Proceedings Eurospeech 2001. Aalborg, Denmark. pp. 1679-1682

Return to the top of this page.

Protocol

A protocol (Binnenpoorte, 2002) was written in order to make sure that the manual verification of the word segmentation happened at least as consistent as possible. In order to achieve this, several guidelines were formulated. The most important of these were:

do not drag the marks unnecessarily
respect the one-to-one relation between orthographic and phonetic words

The speech data in the corpus is characterized as continuous speech meaning that words are not separated from each other by pauses, unlike words in written text that are separated by spaces. Sometimes the continuous stream of sounds causes problems when trying to separate words. This happens when two words share phonemes at the end of the first word and the beginning of the second word. How to handle this and other problems is extensively discussed in the protocol.

Binnenpoorte, D. 2002. PProtocol voor manuele verificatie van automatisch gegenereerde woordsegmentaties. (Available here in .ps and .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The word segmentation files are stored in the following way:

The manually verified word segmentations are saved in files with the extension .wrd. The files are in (short) TextGrid format as created by PRAAT. The .wrd files can be found in the directory /data/annot/text/wrd/ of the annotation DVD.
The word segmentations that were not manually verified contain an extra tier that displays the phoneme segmentations. These files have the extension .awd and are also in (short) TextGrid format. The files can be found at /data/annot/text/awd/ of the annotation DVD.
Both file types are converted to an XML representation. The link between the orthographic and phonetic words together with speaker information is stored in files with the extension .bpt and can be found in the directory /data/annot/xml/bpt-fon/ of the annotation DVD for the manually checked segmentations and in the directory /data/annot/xml/bpt-auto/ of the annotation DVD for the non-checked segmentation respectively. The .bpt files in the /data/annot/text/bpt-fon/ directory of the annotation DVD are the same as those created for the broad phonetic transcriptions. The time-alignment information with the speech signal is for both file types stored in XML files with the extension .skp. The manually checked segmentations are in the directory /data/annot/xml/skp-wrd/ of the annotation DVD while the non-checked segmentations can be found in /data/annot/xml/skp-auto/ of the annotation DVD.

An extensive description of the abovementioned file formats can be found in wrd format and the awd format, the bpt format and the skp formats

Return to the top of this page.

Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
177,127
70,945 106,182

b.
Interviews with teachers of Dutch
59,751
34,064 25,687

c.
Spontaneous telephone dialogues (recorded via a switchboard)
270,027

68,886

201,141

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
6,257
0

e.
Simulated business negotiations
25,485
0 25,485

f. Interviews/discussions/debates (broadcast)
100,250
25,144 75,106

g.
(political) Discussions/debates/meetings (non-broadcast)
34,126

9,009
25,117

h.
Lessons recorded in the classroom
36,064

10,103

25,961

i.
Live (eg sports) commentaries (broadcast)
35,116
10,130 24,986

j.
Newsreports/reportages (broadcast)
32,744
7,679 25,065

k.
News (broadcast)
32,601
7,305 25,296

l.
Commentaries/columns/reviews (broadcast)
32,502
7,431 25,071

m.
Ceremonious speeches/sermons
7,077
1,893 5,184

n.
Lectures/seminars
23,056
8,143 14,913

o.
Read speech 135,071 64,848 70,223

Total
1,007,254
331,837 675,417

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	177,127	70,945	106,182
b.	Interviews with teachers of Dutch	59,751	34,064	25,687
c.	Spontaneous telephone dialogues (recorded via a switchboard)	270,027	68,886	201,141
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	6,257	6,257	0
e.	Simulated business negotiations	25,485	0	25,485
f.	Interviews/discussions/debates (broadcast)	100,250	25,144	75,106
g.	(political) Discussions/debates/meetings (non-broadcast)	34,126	9,009	25,117
h.	Lessons recorded in the classroom	36,064	10,103	25,961
i.	Live (eg sports) commentaries (broadcast)	35,116	10,130	24,986
j.	Newsreports/reportages (broadcast)	32,744	7,679	25,065
k.	News (broadcast)	32,601	7,305	25,296
l.	Commentaries/columns/reviews (broadcast)	32,502	7,431	25,071
m.	Ceremonious speeches/sermons	7,077	1,893	5,184
n.	Lectures/seminars	23,056	8,143	14,913
o.	Read speech	135,071	64,848	70,223
Total	1,007,254	331,837	675,417

For all data in the corpus also an automatic word segmentation is available, including the phoneme segmentation. Information about the amount of data and their characteristics can be found in the table onder orthographic transcription.

Return to the top of this page.