Orthographic transcription

All the recorded material was transcribed orthographically. The orthographic transcription is a verbatim record of what was actually said. In the transcription process repetitions, hesitations, false starts and such were transcribed. Background noise, on the other hand, was seldom represented in the transcriptions.

Below the role of the orthographic transcription is discussed in some detail, as are the aims that were pursued. Attention is also given to the protocol that was developed and the procedures that were followed. Information is included about the file types and formats that were used. Finally, an overview is presented of the data that are available in version 1.0.

Read more about

aim and motivation
procedure
protocol
file types and formats
overview of available data
word frequency lists

Aim and motivation

The aim of the orthographic transcription of data in the Spoken Dutch Corpus was two-fold. First of all, it served to provide users with a simple symbolic representation of the audio file. By means of this representation it is easy to navigate through the corpus, is it possible to derive frequency information, etc. The orthographic transcription is one of the few transcription/annotations that are available for the entire corpus. Moreover, the transcription has been checked manually. Secondly, the orthographic transcription formed the basis for all other transcriptions and annotations.

In view of the importance of the orthographic transcription, at the beginning of the project a great deal of attention was devoted to giving thought to what the nature of the orthographic transcription ought to be and formulating the principles underlying the transcription process. An account of various considerations that were weighted can be found in the protocol for orthographic transcriptions. The following principles were adopted:

the orthographic transcription was to follow the international standards that are/have been used with other large (spoken) corpora. The EAGLES guidelines and the documentation with the CHILDES project have influenced the specification of the Spoken Dutch Corpus protocol for orthographic transcription. The documentation with other large speech corpora (eg Switchboard which is available through the Linguistic Data Consortium - LDC) was also taken into consideration. Whenever the Spoken Dutch Corpus protocol deviates from recommendations/guidelines provided elsewhere, this decision is motivated.
the orthographic transcription should require a minimum of interpretation. Thus grammatical 'errors' were not to be corrected and broken-off words were written down as such (they remained incomplete). In line with the recommendations made in eg the documentation with the Switchboard and SpeechDat corpora, it was decided to adopt normal common spelling conventions.
the orthographic transcription should linked to the speech signal. While the transcription was being produced anchor points were introduced to mark off brief stretches of speech (approx. 3 seconds). Thus it became possible to identify words or phrases in the speech signal. Moreover, the short segments were convenient for the transcribers. While transcribing, they usually would play and then replay a segment (repeatedly) before completing the transcription.
the orthographic transcription should be of use to various types of user: language engineers, linguists, lexicographers, phoneticians, etc. Although many representatives of these different user groups were consulted in the developmental stage, it has not been possible to come to a unanimous decision regarding the transcription specifications and procedures.

Return to the top of this page.

Procedure

In order to facilitate the transcription process, use was made of the PRAAT software that was developed by Paul Boersma at the University of Amsterdam. In PRAAT it is not only possible to play the recording and to visualize the signal, it is also possible to produce and view orthographic transcriptions. For each speaker a separate tier is available.

During the process of transcription in the speech signal short segments of approx. 3 seconds were indicated by means of markers. These markers were placed in naturally occurring pauses between words (please note that the places where these markers occur do not necessarily coincide with syntactic boundaries). Later on the markers were used as anchor points for the automatic segmentation.

Return to the top of this page.

Protocol

In view of the principles that were adopted (see above) and the time and money available, a number of criteria were established that formed the basis for the Protocol voor orthografische transcriptie (Goedertier & Goddijn 2000; here available in .ps and .pdf format; Dutch only). These are

consistency
accuracy
transparency

Consistency
The experience gained in a number of other projects (eg Switchboard, SpeechDat) is that it is advisable to maintain standard spelling conventions. This is generally easier for the transcribers, while it also contributes to the degree of consistency. Therefore, in the Spoken Dutch Corpus project standard spelling conventions were used. However, in order to further increase consistency in the transcriptions, in a number of cases it was decided to deviate from standard conventions. This is for instance the case for the use of punctuation marks and the use of capital and small letters.

In order to obtain a transcription that would be as consistent as possible. the spelling of (known) words was checked on-line during the transcription process by means of a spell checker. If an error was detected, the transcriber was supposed to correct the error or to mark it with one of the special symbols that had been specified in the protocol. Thus special symbols were defined for new (ie as yet unknown) words, but also for incomplete words, dialect words, etc. The marked words were validated by a lexicographer and then added to the lexicon.

Accuracy
The procedure for producing an orthographic transcription was set up so as to yield as accurate a transcription as possible. After one transcriber had made a first transcription, a second transcriber would check this transcription. This would involve checking the correctness of the transcription (was everything that was said fully and correctly represented in the transcription, had the speech been attributed to the correct speaker(s), etc.)

The accuracy of the orthographic transcription was subjected to further checks as the data were passed on to receive further transcriptions and annotations. Whenever an error was detected, a bug report was filed. Then the transcription was checked once more and the error corrected.

Transparency
It has been attempted to keep to number of rules in the protocol down to a minimum. This makes it easier for transcribers to memorize them and to apply them correctly. In the protocol not only the rules for transcription have been included, but also a great many examples. As the protocol was being developed, the experiences gained by the transcribers were also taken into account. As a result the protocol has proven to be practicable.

References

Gibbon, D., R. Moore & R. Winski. 1997. Handbook of Standards and Resources for Spoken Language Systems. The Hague: Mouton.
MacWhinney, B. 1999. The CHILDES Project: Tools for Analyzing Talk (2nd ed.) Hillsdale: Lawrence Erlbaum Associates.
Switchboard: A User's Manual. LDC. 1994. http://www.ldc.upenn.edu/readme_files/switchboard.readme.html
Senia, F. & J. Van Velden. 1997. Specification of Orthographic Transcription and Lexicon Conventions. SpeechDat technical report. SD1.3.3. http://www.speechdat/org/SpeechDat.html, deliverables.
Verbmobil. http://www.phonetik.uni-muenchen.de/Verbmobil.html

Return to the top of this page.

File types and formats

The orthographic transcriptions are available in two formats:

the (short) TextGrid format as it is generated by the PRAAT software; the same format can imported again in PRAAT;
XML format. The orthographic transcription and speaker information (speaker ID/IDs) are stored in files of type .pri, while the time information is stored in files of type .skp.

For a detailed description of these formats, see the descriptions of the ort format, the pri format and the skp format.

Files in the TextGrid format are of the type .ort. These files can be found in the directory /data/annot/text/ort/ of the annotation DVD.
Files in the XML format can be found in the directories /data/annot/xml/pri/ and /data/annot/xml/skp/ of the annotation DVD.

Return to the top of this page.

Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
2,626,172
878,383 1,747,789

b.
Interviews with teachers of Dutch
565,433
315,554 249,879

c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633

465,096

743,537

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
343,167
510,204

e.
Simulated business negotiations
136,461
0 136,461

f. Interviews/discussions/debates (broadcast)
790,269
250,708 539,561

g.
(political) Discussions/debates/meetings (non-broadcast)
360,328

138,819
221,509

h.
Lessons recorded in the classroom
405,409

105,436

299,973

i.
Live (eg sports) commentaries (broadcast)
208,399
78,022 130,377

j.
Newsreports/reportages (broadcast)
186,072
95,206 90,866

k.
News (broadcast)
368,153
82,855 285,298

l.
Commentaries/columns/reviews (broadcast)
145,553
65,386 80,167

m.
Ceremonious speeches/sermons
18,075
12,510 5,565

n.
Lectures/seminars
140,901
79,067 61,834

o.
Read speech 903,043 351,419 551,624

Total
8,916,272
3,261,628 5,654,644

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	2,626,172	878,383	1,747,789
b.	Interviews with teachers of Dutch	565,433	315,554	249,879
c.	Spontaneous telephone dialogues (recorded via a switchboard)	1,208,633	465,096	743,537
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	853,371	343,167	510,204
e.	Simulated business negotiations	136,461	0	136,461
f.	Interviews/discussions/debates (broadcast)	790,269	250,708	539,561
g.	(political) Discussions/debates/meetings (non-broadcast)	360,328	138,819	221,509
h.	Lessons recorded in the classroom	405,409	105,436	299,973
i.	Live (eg sports) commentaries (broadcast)	208,399	78,022	130,377
j.	Newsreports/reportages (broadcast)	186,072	95,206	90,866
k.	News (broadcast)	368,153	82,855	285,298
l.	Commentaries/columns/reviews (broadcast)	145,553	65,386	80,167
m.	Ceremonious speeches/sermons	18,075	12,510	5,565
n.	Lectures/seminars	140,901	79,067	61,834
o.	Read speech	903,043	351,419	551,624
Total	8,916,272	3,261,628	5,654,644

Return to the top of this page.

Word frequency lists

On the basis of the data that are available in this release various word frequency lists were compiled. The different lists are the following:

an alphabetical word frequency list which gives information about the frequency of occurrence of words (ie word forms) in the entire corpus (totalph.frq);
a word frequency list presented as a rank order list, again based on all the data in the corpus (totrank.frq);
an alphabetical word order list in which a distinction is made between the Flemish and the Dutch data (areaalph.frq);
a word frequency list presented as a rank order list, again making a distinction between the Flemish data and the Dutch data (arearank.frq);
an alphabetical word frequency list in which the different components in the corpus are distinguished (typealph.frq);
a word frequency list presented as a rank order list, a distinction is made between the different components in the corpus (typerank.frq).

A description of the different lists can be found on ../../lexicon/freq_lst.htm. The frequency lists can be found in the directory /data/lexicon/ of the annotation DVD.

Return to the top of this page.