<ftext> | text with a broad phonetic transcription, word segmentation and phone segmentation | |||||||||||||||
<fau> | an annotation unit. The boundaries of this element are determined by the punctuation mark. | |||||||||||||||
<fw> | a word within the annotation unit (<fau>). | |||||||||||||||
<fmu> | a mark-up unit that may comprise COMMENT or BACKGROUND information. | |||||||||||||||
<tm> | a marker within the mark-up unit (<fmu>). | |||||||||||||||
<fl> | a punctuation mark within the annotation unit (<fau>). | |||||||||||||||
ref | The reference code consists of one, two or three parts (depending
on the element it is associated with) that are separated by a full stop.
The meaning is as follows: <sample number>.<f[am]u rank number>.<f[wm] rank number> |
|||||||||||||||
s | spreaker ID. In the context of the <fau> element posible values of this attribute are: "Nxxxxx", "Vxxxxx" or "UNKOWN" where x denotes a digit. In the context of the <fmu> element the s attribute can have one of two values: "COMMENT" or "BACKGROUND". | |||||||||||||||
w | the orthographic transcription of the word in the context of <fw> or a punctuation mark (".", "..." or "?") in the context of <fl>. | |||||||||||||||
fon | the phonetic transcription of the word. Apart from the symbols from the phonetic symbol set (see the description of the .fon format) the percentage sign '%' is used to indicate a word internal pause. | |||||||||||||||
left/right | the nature of the left/righ boundary of the word. There are five possible
values for this attribute:
|
|||||||||||||||
marked | translates the * coding in the original orthographic transcription (.ort format) as optional attribute of the <fw> element. Possible values are: foreign, dialect, incomplete, mispr, regionalpr and uncertain. | |||||||||||||||
fq | quality of the time interval has one of the following three values:
"man" (manually verified): time markers that have been inserted by a human "auto" (automatically generated): time markers that have been generated by a machine and have not been validated. "auto_unrel" (automatically generated, unreliable): markers generated by a machine which are known to be unreliable. |
|||||||||||||||
times | comprises the time stamps of the phone boundaries. The attribute always contains N+1 timestamps where N = number of phonemes + any word internal pauses ('%'). The first time stamp indicates the beginning of the first phoneme, the second one the beginning of the second phoneme, etc. The final time stamp indicates the end of the last phoneme. |
All the characters from the ISO-8859.1 characterset that were used in the transcription which fall outside the 7-bit range, have been translated according to the Character entity references for ISO 8859-1 characters. The set of special characters used can be found in the ttext.dtd to be found on the annotation DVD. In entities.htm an overview is given of the various standards for this character (sub)set.