Frequency lists
On the basis of the data in the corpus a number
of frequency lists were derived that provide information regarding the frequency
of occurrence of word forms, POS tags and lemmas and combinations of these.
There is also a frequency list of all the word forms and their phonetic transcriptions
which is based on the data for which a manually verified phonetic transcription
is available. The frequency lists can be found in de directory /data/lexicon/
of the annotation DVD; all files can be identified on the basis of the extension .frq.
To the word forms special codes may be attached. Between the word form and the
code a slash forward is used (eg wonderful/foreign). The following codes
are used:
- 'dialect' for dialect words;
- 'foreign' for foreign words;
- 'incomplete' for incomplete words;
- 'mispr' for words that were mispronounced;
- 'regionalpr' for words that are pronounced
with strong local/regional accent;
- 'uncertain' for words that are difficult
to hear.
The following types of frequency list are distinguished:
- totalph
an alphabetical word frequency list in the frequency
of occurrence is listed of all the word forms in all the data in version 1.0;
the columns list the following information:
- the rank number of the word form;
- the total frequency of the word form in
the entire corpus;
- the word form.
- totrank
a word frequency list presented as a rank order
list, again based on all the data in the corpus; the columns list the following
information:
- the rank number of the word form, the
highest ranking item occurring at the top of the list;
- the total frequency of the word form in
the entire corpus;
- the word form.
- areaalph
an alphabetical word frequency list in which
a distinction is made between data originating from Flanders and data originating
from the Netherlands; the columns list the following information:
- the rank number of the word form;
- the total frequency of the word form in
the Dutch data;
- the total frequency of the word form in
the Flemish data;
- the total frequency of the word form in
the entire corpus;
- the word form.
- arearank
a word frequency list presented as a rank order
list in which a distinction is made between data originating from Flanders and
data originating from the Netherlands; the columns list the following information:
- the rank number of the word form, the
highest ranking item occurring at the top of the list;
- the total frequency of the word form in
the Dutch data;
- the total frequency of the word form in
the Flemish data;
- the total frequency of the word form in
the entiure corpus;
- the word form.
- typealph
an alphabetical word frequency list in which
the 15 components (speech types) in the corpus are distinguished; the columns
list the following information:
- the rank order of the word form;
- the total frequency of the word form in
components a-0;
- (...)
- the total frequency of the
word forms in the entire corpus;
- the word form.
- typerank
a word frequency lists presented as a rank order
list in which the 15 components (speech types) in the corpus are distinguished;
the columns list the following information:
- the rank order of the word form, the highest
ranking item occurring at the top of the list;
- the total frequency of the word form per
component (components a-o);
- (...)
- the total frequency of the
word form in the entire corpus;
- the word form.
- tagalph
an alphabetical frequency list of all POS tags;
this list is structured as follows:
- [part-of-speech frequency]
[part-of-speech]
- [tag frequency per part-of-speech]
[POS tag]
- lemalph
a frequency list of lemmas with the associated
word forms and POS tags; this list is structured as follows:
- [NL-freq. lemma] [VL-freq.
lemma] [tot. freq. lemma] [lemma]
- [NL-freq. word form-tag]
[VL-freq. word form-tag] [tot. freq. word form-tag]
[tag] [word form]
- fonalph
a frequency list of word forms and their phonetic
transcriptions; this list is structured as follows:
- [NL-freq. word form]
[VL-freq. word form] [tot. freq. word form]
[word form]
- [NL-freq. pron.] [VL-freq.
pron.] [tot. freq. pron.] [pron.]
Please note that this list is based exclusively
on the data in the corpus for which a manually verified broad phonetic transcription
is available.