The goal of DutchSemCor is to deliver a one-million word Dutch corpus that is fully sense-tagged with senses and domain tags from the Cornetto database (STEVIN project STE05039). 250K words of this corpus will be manually tagged. The remainder will be automatically tagged using three different word-sense-disambiguation systems (WSD), and will be validated by human annotators. The corpus data will be based on existing corpus material collected in the projects CGN, D-CoI and SoNaR. These corpora have already been automatically annotated with morpho-syntactic tags and structures. The corpora will be extended where necessary to find sufficient examples for meanings of words that are less frequent and do not appear in the above corpora. The resulting corpus, for which we aim to offer the same balance in types of text as these basic resources, will be extremely rich in terms of lexical semantic information. Its availability will enable many new lines of research and technology developments for the Dutch language. In particular, it will enable research into the relation between language form and language interpretation, and as such it will be applicable in the fields of cognitive science, (psycho-)linguistics, language learning and language teaching, semantic web applications, information retrieval, machine translation, text mining, and document interpretation (summarization, topic segmentation). We foresee that the corpus will create new directions of research and technology development on a par with current developments for English. (news VUA).
The goal of KYOTO is to develop a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and the general public, covering a broad range of data from wide-spread sources in a number of culturally diverse languages. In this project we will target the languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. This powerful system crucially rests on an ontology linked to wordnets --lexical semantic databases--in a variety of languages. Concept extraction and data mining are applied through a chain of semantic processors that re-use the knowledge for different languages and for particular domains. The shared ontology guarantees a uniform interpretation for diverse types of information from different sources and languages. The system can be maintained by field specialists using a Wiki platform. KYOTO is a generic system offering knowledge transition for any domain of knowledge and information, across different target groups in society and across linguistic, cultural and geographic borders. KYOTO will be applied to the environmental domain and span global information across European and non-European languages. For more information see: