A part of the data in the corpus was enriched with a manually verified broad phonetic transcription. The manual transcription comprises the verification and correction of a given automatically generated phonetic transcription. The transcriptions are broad in the sense that no allophonic variation or diacritics are in the pre-defined phoneme set.
Below information can be found about the aims, the transcription procedure that was adopted, and the protocol that was used. We also refer to the relevant file types and formats. Finally, an overview is presented of the data that are available in version 1.0.
Read more about
The main goal was to obtain verified broad phonetic transcriptions of the material by means of insertions, deletions and substitutions in the given automatically generated transcription. No gradual processes, such as voicing-devoicing in plosives and fricatives and diphthongization and monophthongization of vocals, are transcribed in this broad phonetic transcription.
The phoneme set that was used is
described here (in a .ps
file or a .pdf file).
The automatically generated transcription not only resulted in a more efficient transcription procedure in terms of time, but it also increased the consistency between the transcribers. The human transcribers’ task was to listen to the speech signal and decide for each symbol in the transcript whether it should be deleted, substituted by another phoneme, or whether one or more phonemes were missing in the given transcription.
The PRAAT software was used to create the manual broad phonetic transcription. One of the advantages of the PRAAT program is that it’s possible to display both an oscillogram of the speech signal and the accompanying transcription and to replay the speech signal if required. It was decided to only display the given phonetic transcription, so without the original orthographic transcription. More conversational and extemporaneous types of speech, which are known to be more difficult to transcribe, were created in two rounds to ensure that the quality of the transcription would meet a certain level. This means that the transcription of one human transcriber was submitted to another transcriber who was asked to verify and correct this transcription.
More information about the procedure and the transcription quality of the final result can be found in Goddijn and Binnenpoorte (2003).
Reference:
S. Goddijn & D. Binnenpoorte,
‘Assessing Manually Corrected Broad Phonetic Transcriptions in the Spoken
Dutch Corpus’, in Proceedings of 15th ICPhS, Barcelona, Spain, pp.
1361-1364, 2003.
In a protocol (Gillis, 2001), the transcription rules are stated in order to establish a higher consistency between the transcribers. In this protocol, the phoneme set and additional symbols are described, and many examples are given of how to use the phonemes. One of the main guidelines was not to have too much confidence in the given transcription, but to decide on a phoneme on the basis of one's own perception. Only in case of doubt, the original symbol could maintain in the transcription.
Literatuurverwijzing:
Gillis, S. 2001. Protocol voor brede
fonetische transcriptie. (Available here in .ps
and .pdf
format; Dutch only.)
The broad phonetic transcriptions
are stored in the following types of files:
In Table 1 an overview is presented
of the data that are available in version 1.0. For a more detailed description
of the corpus design and the motivation for this design, we refer to the
corpus design and motivation.
Table 1. Overview of available
data (VL = data originating from Flanders; NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
177,127
|
70,945 | 106,182 |
b.
|
Interviews with teachers of Dutch |
59,751
|
34,064 | 25,687 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
270,027
|
68,886
|
201,141
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
6,257
|
6,257 |
0
|
e.
|
Simulated business negotiations |
25,485
|
0 | 25,485 |
f. | Interviews/discussions/debates (broadcast) |
100,250
|
25,144 | 75,106 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
34,126
|
9,009
|
25,117 |
h.
|
Lessons recorded in the classroom |
36,064
|
10,103
|
25,961
|
i.
|
Live (eg sports) commentaries (broadcast) |
35,116
|
10,130 | 24,986 |
j.
|
Newsreports/reportages (broadcast) |
32,744
|
7,679 | 25,065 |
k.
|
News (broadcast) |
32,601
|
7,305 | 25,296 |
l.
|
Commentaries/columns/reviews (broadcast) |
32,502
|
7,431 | 25,071 |
m.
|
Ceremonious speeches/sermons |
7,077
|
1,893 | 5,184 |
n.
|
Lectures/seminars |
23,056
|
8,143 | 14,913 |
o.
|
Read speech | 135,071 | 64,848 | 70,223 |
Total |
1,007,254
|
331,837 | 675,417 |
A frequency list was derived from
the manually verified data available in version 1.0 The list gives information
about the frequency of the occurrence of a phonetic transcription given
a certain orthographic instance in this part of the corpus. A description
can be found on ../../lexicon/freq_lst.htm.
The frequency list itself (fonalph.frq) can be found in the directory /data/lexicon/freqlists/
of the annotation DVD.