Broad phonetic transcription

Broad phonetic transcription

A part of the data in the corpus was enriched with a manually verified broad phonetic transcription. The manual transcription comprises the verification and correction of a given automatically generated phonetic transcription. The transcriptions are broad in the sense that no allophonic variation or diacritics are in the pre-defined phoneme set.

Below information can be found about the aims, the transcription procedure that was adopted, and the protocol that was used. We also refer to the relevant file types and formats. Finally, an overview is presented of the data that are available in version 1.0.

Read more about

aim and motivation
procedure
protocol
file types and formats
overview of available data
frequency information

Aim and motivation

The main goal was to obtain verified broad phonetic transcriptions of the material by means of insertions, deletions and substitutions in the given automatically generated transcription. No gradual processes, such as voicing-devoicing in plosives and fricatives and diphthongization and monophthongization of vocals, are transcribed in this broad phonetic transcription.

The phoneme set that was used is described here (in a .ps file or a .pdf file).

Return to the top of this page.

Procedure

The automatically generated transcription not only resulted in a more efficient transcription procedure in terms of time, but it also increased the consistency between the transcribers. The human transcribers’ task was to listen to the speech signal and decide for each symbol in the transcript whether it should be deleted, substituted by another phoneme, or whether one or more phonemes were missing in the given transcription.

The PRAAT software was used to create the manual broad phonetic transcription. One of the advantages of the PRAAT program is that it’s possible to display both an oscillogram of the speech signal and the accompanying transcription and to replay the speech signal if required. It was decided to only display the given phonetic transcription, so without the original orthographic transcription. More conversational and extemporaneous types of speech, which are known to be more difficult to transcribe, were created in two rounds to ensure that the quality of the transcription would meet a certain level. This means that the transcription of one human transcriber was submitted to another transcriber who was asked to verify and correct this transcription.

More information about the procedure and the transcription quality of the final result can be found in Goddijn and Binnenpoorte (2003).

Reference:

S. Goddijn & D. Binnenpoorte, ‘Assessing Manually Corrected Broad Phonetic Transcriptions in the Spoken Dutch Corpus’, in Proceedings of 15th ICPhS, Barcelona, Spain, pp. 1361-1364, 2003.

Return to the top of this page.

Protocol

In a protocol (Gillis, 2001), the transcription rules are stated in order to establish a higher consistency between the transcribers. In this protocol, the phoneme set and additional symbols are described, and many examples are given of how to use the phonemes. One of the main guidelines was not to have too much confidence in the given transcription, but to decide on a phoneme on the basis of one's own perception. Only in case of doubt, the original symbol could maintain in the transcription.

Literatuurverwijzing:

Gillis, S. 2001. Protocol voor brede fonetische transcriptie. (Available here in .ps and .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The broad phonetic transcriptions are stored in the following types of files:

The manual transcriptions created by using PRAAT have the extension .fon. These files are (short) TextGrids and can be found in the directory /data/annot/text/fon/ of the annotation DVD
An XML conversion of the .fon files is stored in files with the extension .bpt. In these .bpt files the link is displayed between the original orthographic word and its phonetic transcription given the manual verification of the wordsegmentation. The XML files can be found in the directory /data/annot/xml/bpt-fon/ of the annotation DVD.

Return to the top of this page.

Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
177,127
70,945 106,182

b.
Interviews with teachers of Dutch
59,751
34,064 25,687

c.
Spontaneous telephone dialogues (recorded via a switchboard)
270,027

68,886

201,141

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
6,257
0

e.
Simulated business negotiations
25,485
0 25,485

f. Interviews/discussions/debates (broadcast)
100,250
25,144 75,106

g.
(political) Discussions/debates/meetings (non-broadcast)
34,126

9,009
25,117

h.
Lessons recorded in the classroom
36,064

10,103

25,961

i.
Live (eg sports) commentaries (broadcast)
35,116
10,130 24,986

j.
Newsreports/reportages (broadcast)
32,744
7,679 25,065

k.
News (broadcast)
32,601
7,305 25,296

l.
Commentaries/columns/reviews (broadcast)
32,502
7,431 25,071

m.
Ceremonious speeches/sermons
7,077
1,893 5,184

n.
Lectures/seminars
23,056
8,143 14,913

o.
Read speech 135,071 64,848 70,223

Total
1,007,254
331,837 675,417

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	177,127	70,945	106,182
b.	Interviews with teachers of Dutch	59,751	34,064	25,687
c.	Spontaneous telephone dialogues (recorded via a switchboard)	270,027	68,886	201,141
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	6,257	6,257	0
e.	Simulated business negotiations	25,485	0	25,485
f.	Interviews/discussions/debates (broadcast)	100,250	25,144	75,106
g.	(political) Discussions/debates/meetings (non-broadcast)	34,126	9,009	25,117
h.	Lessons recorded in the classroom	36,064	10,103	25,961
i.	Live (eg sports) commentaries (broadcast)	35,116	10,130	24,986
j.	Newsreports/reportages (broadcast)	32,744	7,679	25,065
k.	News (broadcast)	32,601	7,305	25,296
l.	Commentaries/columns/reviews (broadcast)	32,502	7,431	25,071
m.	Ceremonious speeches/sermons	7,077	1,893	5,184
n.	Lectures/seminars	23,056	8,143	14,913
o.	Read speech	135,071	64,848	70,223
Total	1,007,254	331,837	675,417

Return to the top of this page.

Frequency information

A frequency list was derived from the manually verified data available in version 1.0 The list gives information about the frequency of the occurrence of a phonetic transcription given a certain orthographic instance in this part of the corpus. A description can be found on ../../lexicon/freq_lst.htm. The frequency list itself (fonalph.frq) can be found in the directory /data/lexicon/freqlists/ of the annotation DVD.

Return to the top of this page.