Old Projects:
- EuroWordNet (LE2 4003 and LE 8328), 1996-2000
Project coordinator of the EuroWordNet project (LE2 4003 and LE 8328) and site-manager for the Dutch wordnet and in charge of a team of 6 researchers and programmers.. The aim of the EuroWordNet project was to develop a multilingual database with wordnets for Dutch, Italian, Spanish, French, German, Czech, Estonian and English. Each wordnet was also linked to a so-called Inter-Lingual-Index, based on WordNet1.5. The database has been tested in an Information Retrieval application by Novell Linguistic Development. For more information see:
- EuroWordNet;
- Powerpoint presentation by Vossen;
- EuroWordNet General Document;
- Piek Vossen (ed.), 1999
EuroWordNet, Kluwer Academic Publishers;
- Piek Vossen, 2001
Condensed Meaning in EuroWordNet, The Language of Word Meaning: studies in Natural Language Processing, Cambridge University Press.
- Meaning: Developing Multilingual Web-scale Language Technologies (IST-2001-34460), 2002-2005
An international EU 5th Framework IST Project.
MEANING will be concerned with automatically collecting and analysing language data from the WWW on a large scale, and building more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation (WSD). MEANING will use state of the art NLP techniques pioneered by the consortium to enhance EuroWordNet with mainly language-independent lexico-semantic (concept) information. We will use a combination of Machine Learning and Knowledge-Based techniques in order to enrich the structure of the wordnets in different domains (subsets of the web) in five European languages: English, Italian, Spanish, Catalan and Basque. The core technology used by MEANING will include tools to perform language identification, morphological analysis, part-of-speech tagging,
named-entity recognition and classification, sentence boundary detection, shallow parsing and text categorization.
- Euroterm: Extending the EuroWordNet with Public Sector Terminology, 2001-2003
The main objective of the proposed preparatory action is to offer public sector information in different languages beyond the mere homogeneous linguistic community it
originates from. The extension of the EuroWordNet and the Inter-Lingual-Index records with public sector terminology aims at providing the infrastructure for facilitating access
of European citizens and enterprises to public sector information. The action’s actual purpose is to combine effectively multilingual domain specific information into a common
lexical database through a Terminology Alignment System, lowering this way the barriers across European languages. Another main goal of the preparatory action is to explore how
such a multilingual lexical database could facilitate access to environmental information and contribute to the development and use of the European digital content. On a longer
term we expect that the domain specific terminology incorporated into the EuroWordNet multilingual database will open up a whole new range of services in Europe at a trans-national level.
- Balkanet Design and Development of a Multilingual Balkan WordNet, (IST-2000-29388), 2002-2004
The Balkan WordNet aims at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages. The most ambitious feature of the BalkaNet
is its attempt to represent semantic relations between words in each Balkan language and link them together in order to develop an on line multilingual semantic network. The main
objective is the development of each's languages WordNet from available resources covering the general vocabulary of each language. Semantic relations will be classified in the
independent WordNets according to a shared ontology. Then, all individual WordNets will be organized into a common database providing linking across them. Each of the WordNets
will be structured along the same lines as the EuroWordNet through a WordNet Management System. This project is an excellent opportunity to explore the less studied Balkan languages
and combine and compare them cross-linguistically.
- Pidgin, 2002-2004
Financially supported by the Dutch government within the framework of the CIC-programme
(Compete with ICT-competences) of the Ministries of Economic Affairs and of Educational and Cultural Affairs. In the Netherlands a consortium of five organisations has started a project with
the aim to improve the cross-lingual communication on the Internet. The name of the project is Pidgin. This name has been derived from the pidgin languages that were developed between
people with different language backgrounds during colonial times. The Pidgin-project is an unique project which combines the sciences of IT and linguistics. By using special techniques
of both sciences, Pidgin enables internet users all over the world:
1. to retrieve multi-lingual information from the internet (human-machine);
2. to communicate with their fellow users in their own native languages (human-human).
Pidgin combines advanced search techniques with translation strategies. To improve the searching, the computer has to be able to "understand" language by way of syntactic and semantic analysis.
The further development of these techniques can be stimulated by translation techniques that go beyond the currently available translation techniques that are based on machine translation or
translation memory. Pidgin is self learning and makes use of an enormous semantic network. The idea behind Pidgin is not to deliver perfect translations, but to improve cross-lingual communication
between man and machine and between people. It is like real life communication between two persons with a different mother tongue: by talking and writing to each other, each one of them will learn
more about the language of the other person. PidGin makes use of the same principle. Every time Pidgin is used it learns from the user and improves its own knowledge. The project will be executed
in two phases: at the end of the year 2002 the project will yield the first results of the cross lingual communication between man and machine. The perfection of the cross-lingual communication between
between people is planned for the year 2004.
- Sift: Selecting Information From Text (LRE 62030) 1994-1996
Project-manager of the SIFT-project (LRE 62030) for Amsterdam and in charge of a team of 4 researchers and programmers. The aim of the Sift project was to develop a text retrieval system that makes use of distributive semantic representations of words. Distributive representations consist of simple semantic features with weights indicating the relevance of these features for the concept associated with a word. Using these representations equivalence can be measured in a flexible and computationally tractable way. The main result for Amsterdam has been the development of the first version of the Amsterdam Lexicon System (ALS). This is a very efficient and fast object-oriented lexical database system, developed in C and running on Unix, Windows and Macs For more information see:
- Acquilex-2 (BRA 7315) 1992-1995
A sequel to Acquilex-i in which also other resources such as corpora are considered to build up lexicons, and in which we cooperated with the publishers Van Dale in the Netherlands, Cambridge University Press in the UK and Bibliograf in Spain.
For more information see:
- Acquilex-1 (Esprit BRA-3030) 1989-1992
Vossen has been senior researcher on the Acquilex-1 project (Esprit BRA-3030): Acquilex-1 was a joint enterprise of the Universities of Cambridge, Dublin, Pisa, Barcelona and Amsterdam in which we examined the feasibility of building a multilingual knowledge base usable in Natural Language Processing on the basis of information extracted from several Machine Readable Dictionaries. His work focused on LDOCE and a Van Dale monolingual Dutch dictionary, which he linked using Van Dale bilingual Dutch-English and English-Dutch dictionaries. He developed definition parsers for Dutch definitions and made programs for automatically converting the analyzed information into formal typed feature structure representations, which can be loaded into a lexical knowledge base. Finally, he developed some techniques and tools to (cross-linguistically) compare the semantic organization of different dictionaries.For more information see:
- Links (NWO 300-169-007) 1983-1986
Researcher in the Links-project in which he developed a parser for the definitions of nouns, verbs and adjectives of the Longman Dictionary of Contemporary English and stored the output (65,000 parse-trees) in the Nijmegen Linguistic Database. The Links-project was funded by the Dutch Council of Research (NWO).