All the recorded material was transcribed orthographically. The orthographic transcription is a verbatim record of what was actually said. In the transcription process repetitions, hesitations, false starts and such were transcribed. Background noise, on the other hand, was seldom represented in the transcriptions.
Below the role of the orthographic transcription is discussed in some detail, as are the aims that were pursued. Attention is also given to the protocol that was developed and the procedures that were followed. Information is included about the file types and formats that were used. Finally, an overview is presented of the data that are available in version 1.0.
Read more about
The aim of the orthographic transcription of data in the Spoken Dutch Corpus was two-fold. First of all, it served to provide users with a simple symbolic representation of the audio file. By means of this representation it is easy to navigate through the corpus, is it possible to derive frequency information, etc. The orthographic transcription is one of the few transcription/annotations that are available for the entire corpus. Moreover, the transcription has been checked manually. Secondly, the orthographic transcription formed the basis for all other transcriptions and annotations.
In view of the importance of the orthographic transcription, at the beginning of the project a great deal of attention was devoted to giving thought to what the nature of the orthographic transcription ought to be and formulating the principles underlying the transcription process. An account of various considerations that were weighted can be found in the protocol for orthographic transcriptions. The following principles were adopted:
In order to facilitate the transcription process, use was made of the PRAAT software that was developed by Paul Boersma at the University of Amsterdam. In PRAAT it is not only possible to play the recording and to visualize the signal, it is also possible to produce and view orthographic transcriptions. For each speaker a separate tier is available.
During the process of transcription
in the speech signal short segments of approx. 3 seconds were indicated
by means of markers. These markers were placed in naturally occurring pauses
between words (please note that the places where these markers occur do
not necessarily coincide with syntactic boundaries). Later on the markers
were used as anchor points for the automatic segmentation.
In view of the principles that were adopted (see above) and the time and money available, a number of criteria were established that formed the basis for the Protocol voor orthografische transcriptie (Goedertier & Goddijn 2000; here available in .ps and .pdf format; Dutch only). These are
In order to obtain a transcription that would be as consistent as possible. the spelling of (known) words was checked on-line during the transcription process by means of a spell checker. If an error was detected, the transcriber was supposed to correct the error or to mark it with one of the special symbols that had been specified in the protocol. Thus special symbols were defined for new (ie as yet unknown) words, but also for incomplete words, dialect words, etc. The marked words were validated by a lexicographer and then added to the lexicon.
Accuracy
The procedure for producing an orthographic
transcription was set up so as to yield as accurate a transcription as
possible. After one transcriber had made a first transcription, a second
transcriber would check this transcription. This would involve checking
the correctness of the transcription (was everything that was said fully
and correctly represented in the transcription, had the speech been attributed
to the correct speaker(s), etc.)
The accuracy of the orthographic transcription was subjected to further checks as the data were passed on to receive further transcriptions and annotations. Whenever an error was detected, a bug report was filed. Then the transcription was checked once more and the error corrected.
Transparency
It has been attempted to keep to
number of rules in the protocol down to a minimum. This makes it easier
for transcribers to memorize them and to apply them correctly. In the protocol
not only the rules for transcription have been included, but also a great
many examples. As the protocol was being developed, the experiences gained
by the transcribers were also taken into account. As a result the protocol
has proven to be practicable.
References
The orthographic transcriptions are available in two formats:
Files in the TextGrid format are
of the type .ort. These files can be found in the directory /data/annot/text/ort/
of the annotation DVD.
Files in the XML format can be found
in the directories /data/annot/xml/pri/ and /data/annot/xml/skp/ of the
annotation DVD.
In Table 1 an overview is presented
of the data that are available in version 1.0. For a more detailed description
of the corpus design and the motivation for this design, we refer to the
corpus design and motivation.
Table 1. Overview of available
data (VL = data originating from Flanders; NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
2,626,172
|
878,383 | 1,747,789 |
b.
|
Interviews with teachers of Dutch |
565,433
|
315,554 | 249,879 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
1,208,633
|
465,096
|
743,537
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
853,371
|
343,167 |
510,204
|
e.
|
Simulated business negotiations |
136,461
|
0 | 136,461 |
f. | Interviews/discussions/debates (broadcast) |
790,269
|
250,708 | 539,561 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
360,328
|
138,819
|
221,509 |
h.
|
Lessons recorded in the classroom |
405,409
|
105,436
|
299,973
|
i.
|
Live (eg sports) commentaries (broadcast) |
208,399
|
78,022 | 130,377 |
j.
|
Newsreports/reportages (broadcast) |
186,072
|
95,206 | 90,866 |
k.
|
News (broadcast) |
368,153
|
82,855 | 285,298 |
l.
|
Commentaries/columns/reviews (broadcast) |
145,553
|
65,386 | 80,167 |
m.
|
Ceremonious speeches/sermons |
18,075
|
12,510 | 5,565 |
n.
|
Lectures/seminars |
140,901
|
79,067 | 61,834 |
o.
|
Read speech | 903,043 | 351,419 | 551,624 |
Total |
8,916,272
|
3,261,628 | 5,654,644 |
On the basis of the data that are available in this release various word frequency lists were compiled. The different lists are the following: