Lexicon link-up

All word forms (tokens) in the corpus have been lemmatised. In the process of lemmatisation the word-for-word principle was adopted. For situations where a token was part of a multi-word unit, or part of a discontinuous verb or preposition, the same principle applied. It was not until later that, in a separate phase that we shall here refer to as lexicon link-up, these items were lemmatised as multiword units. Via these lemmas reference to the lexicon was made so as to enable the search of the parts of multiword expressions.

Below we discuss in some detail the aim and motivation for the lexicon link-up. Attention is also given to the protocol and the procedure that were developed. We also provide information about the file types and formats. Finally, an overview is presented of the data that are available in version 1.0.
 

Read more about



Aim and motivation

In order to be able to comply with existing annotation standards and also for more practical reasons (eg the automation of the annotation procedures as well as the synchronisation of different annotations), the word was taken as the base unit for annotation. However, since various grammatical and lexicological theories recognize the multi-word unit as a separate entity, we decided to mark some types of multi-word units. These multi-word units have been included in the lexicon and may be used by the user for the specification of complex searches.

The following multi-word expression are distinguished:

Other multi-word structures have been considered for inclusion. These include compound prepositions (bij monde van, met het oog op) and idiomatic expressions that permit a certain variability (in mijn/je/haar/zijn ... nopjes zijn, een modderfiguur slaan). Because it is impossible to formulate the set of criteria by means of which these sets of items may be delimited, we have decided not to include these items as multi-word items. For two other categories, discontinuous prepositions (tussen ... in , van ... af) and discontinuous pronominal adverbs (er ... doorheen, daar ... mee), the lexicon link-up was considered to be to difficult because of the complex orthographical and grammatical relationship with the possibly discontinuous verbs. Multi-word units showing contraction (in- en uitvoer, probleemformulering of -oplossing) should be linked-up to the lexicon in a follow-up project.
 
Return to the top of this page.

Procedure

In order to facilitate the annotation process, we decided to transcribe and tag the samples first. Once the results had been verified, the orthographic transcription and the POS tags were used as input for the lexicon link-up.

An inventory was made of all the possibly discontinuous verbs and foreign expressions that occurred in various lexical resources and in the corpus. This inventory served to automatically mark instances in the corpus as possible multi-word expressions. The same procedure was adopted for all continuous sequences of words starting with a capital letter (these were candidates for being identified as compound proper names or titles). A test was conducted involving a subset of the corpus comprising some 1 million tokens. All multi-word units were checked. As a result, we were able to further improve the protocol.

Finally for the entire corpus all candidate multi-word expressions were identified automatically. The output of the automatic process was then subjected to manual verification. In the POS annotation files and the output files of the lexicon link-up, the tag and the .lxk files, numerical codes were introduced that relate to the lemmas of multi-word expressions.
 
 

Return to the top of this page.

Protocol

For the lexicon link-up a separate protocol was developed:

Piepenbrock, R. 2004. Taalkundig protocol voor de lexicologische koppeling. (Here available in .ps and .pdf format; Dutch only.)
 
 

Return to the top of this page.

File types and formats

The information that was added in the lexicon link-up phase has been stored in the following files: For descriptions of the formats mentioned above, please see the descriptions of the lex format and the lxk format.
 
 
Return to the top of this page.

Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.
 

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)
 
 
Component Total number 
of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
2,626,172
 878,383 1,747,789
b.
Interviews with teachers of Dutch
565,433
 315,554 249,879
c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633
465,096
743,537
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
 343,167
510,204
  e.
Simulated business negotiations
136,461
 0  136,461
  f. Interviews/discussions/debates (broadcast)
790,269
250,708  539,561
  g.
(political) Discussions/debates/meetings (non-broadcast)
360,328
138,819
 221,509
h.
Lessons recorded in the classroom
405,409
105,436
299,973
i.
Live (eg sports) commentaries (broadcast)
208,399
 78,022  130,377
j.
Newsreports/reportages (broadcast)
186,072
 95,206  90,866
k.
News (broadcast)
368,153
 82,855  285,298
l.
Commentaries/columns/reviews (broadcast)
145,553
 65,386  80,167
m.
Ceremonious speeches/sermons
18,075
 12,510  5,565
n.
Lectures/seminars
140,901
 79,067  61,834
o.
Read speech  903,043 351,419 551,624
Total
8,916,272
3,261,628 5,654,644

 

Return to the top of this page.