Files of type .plk contain information about the part-of-speech tagging, lemmatisation, lexicon link-up and information about multi-word expressions.
In a .plk file two types of lines occur:
<au> | an annotation unit. The boundaries of this element are determined by the puntuation mark. |
<mu> | a mark-up unit that may contain COMMENT or BACKGROUND information. |
s | speakerID. In the context of the <au> element possible values of this attribute are: Nxxxxx, Vxxxxx or UNKOWN where x denotes a digit. In the context of the <mu> element the s-attribute may have either of two values: COMMENT or BACKGROUND. |
tb | time begin (in seconds) of the annotation unit. The time begin has been derived from the .ort file. A time marker may coincide with a sentence boundary, but this need not be the case. Therefore, the time begin may be somewhat earlier than the actual beginning of the sentence in the audio file. |
column1 | word form (token) as it occurs in the orthogrfaphic transcription (cf. data in the .ort files) |
column2 | part-of-speech tag that has been assigned to a token. For an overview of the tags that were used, see /data/annot/text/plk/tagset.txt on the annotation DVD. |
column3 | lemma of the token. The underscore ("_") indicates that a lemma is absent. |
column4 | lexicon-ID of the word form. The ID refers to the single-word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD) |
column5 | lexicon-ID of the lemma associated with the word form. The ID refers to the single-word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD) |
column6 | multi-word lemma (when different from column 3) |
column7 | lexicon-ID of the multi-word lemma. The ID refers to the multi-word lexicon (/data/lexicon/text/cgnmlex.txt on the annotation DVD) |
column8 | References to the different parts of the multi-word expression by means of the rank number of the word in the sentence. |
A lexicon-ID with the value "0" signifies that the lemma or the word form has not been linked-up by the lexicon (i.e. is not considered to be part of a multi-word expression). Whenever an element of a multi-word expression has been omitted, as in ik deed (aandoen en uitdoen) het licht aan en uit, then the lemmas that occur with the word form deed are separated by a forward dash ("/"), the same goes for the associated lexicon-IDs in the next column. When a lemma or word form is ambiguous (there are multiple references to the lexicon), the lexicon-IDs are separated by a vertical bar ("|").