Old Projects of Prof. Dr. Piek Th.J.M. Vossen:
- KYOTO (Knowledge Yielding Ontologies for Transition-based Organization): 7th EC Framework project ICT-211423 (2008-2011)
Project coordinator of KYOTO. The goal of KYOTO was to develop a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and the general public, covering a broad range of data from wide-spread sources in a number of culturally diverse languages. In this project we targeted the languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. This powerful system crucially rests on an ontology linked to wordnets --lexical semantic databases--in a variety of languages. Concept extraction and data mining were applied through a chain of semantic processors that re-used the knowledge for different languages and for particular domains. The shared ontology guaranteed a uniform interpretation for diverse types of information from different sources and languages. The system can be maintained by field specialists using a Wiki platform. KYOTO is a generic system offering knowledge transition for any domain of knowledge and information, across different target groups in society and across linguistic, cultural and geographic borders. KYOTO will be applied to the environmental domain and span global information across European and non-European languages. For more information see:
- EuroWordNet (EWN): 4th EC Framework Project LE2 4003 and LE 8328 (1996-2000)
Project coordinator of the EuroWordNet I and II project and site-manager for the Dutch wordnet. The aim of the EuroWordNet project was to develop a multilingual database with wordnets for Dutch, Italian, Spanish, French, German, Czech, Estonian and English. Each wordnet was also linked to a so-called Inter-Lingual-Index, based on WordNet1.5. The database has been tested in an Information Retrieval application by Novell Linguistic Development. For more information see:
- FLaReNet (Fostering Language Resources Network): 7th EU Framework project ECP-2007-LANG-617001 (2007-2011)
Projectpartner of Flarenet which addressed the development and exploitation of Language Resources (LRs) and Language Technology (LT) that will enable easy and natural access to the wealth of information and knowledge encapsulated in (written, spoken, multimodal) digital documents.
FLaReNet expectation is to achieve:
- the largest network of LR and HLT players, ready to discuss information and knowledge on LRs and LT, and with appropriate mechanisms to share such information and disseminate it widely
- an extended picture of LRs and a recasting of its definition in the light of recent scientific, methodological,
technological, social developments
- a consolidation of methods and approaches, common practices, frameworks and architectures
- a roadmap identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities
- a set of recommendations in the form of a plan of coherent actions for the EU and national organizations
- a European model for the LRs of the next years.
Bringing together the diverse approaches, efforts and technologies represented by its membership, FLaReNet will enable even greater progress toward community consensus.
- Arabic WordNet (AWN): American Government (2005-2007)
Constructing an Arabic Wordnet (AWN) in Parallel with an Ontology. Project partner of this project which aimed at the development of wordnets for the Arabic languages and was sponsored by the American government and headed by Princeton University. Vossen was responsible for the European development and coordination of building a lexical resource in Standard Arabic. AWN was constructed according to the methods developed for EuroWordNet (EWN;Vossen 1998) and since applied to dozens of languages around the world. The EuroWordNet approach maximizes compatibility across wordnets and focuses on manual encoding of the most complicated and important concepts.
Arabic WordNet is mappable straightforwardly onto Princeton WordNet 2.0 and EuroWordNet, enabling translation on the lexical level to English and dozens of other languages. Several tools specific to this task will be developed. AWN will be a linguistic resource with a deep formal semantic foundation. Besides the standard wordnet representation of senses, word meanings are defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology (SUMO) and its associated domain ontologies.
- Meaning (Developing Multilingual Web-scale Language Technologies): 5th EC Framework project IST-2001-34460 (2002-2005)
Project partner of
MEANING which was concerned with automatically collecting and analysing language data from the WWW on a large scale, and building more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation (WSD). MEANING used state of the art NLP techniques pioneered by the consortium to enhance EuroWordNet with mainly language-independent lexico-semantic (concept) information. We used a combination of Machine Learning and Knowledge-Based techniques in order to enrich the structure of the wordnets in different domains (subsets of the web) in five European languages: English, Italian, Spanish, Catalan and Basque. The core technology used by MEANING included tools to perform language identification, morphological analysis, part-of-speech tagging,
named-entity recognition and classification, sentence boundary detection, shallow parsing and text categorization.
- Euroterm: (Extending the EuroWordNet with Public Sector Terminology): EC Framework project (2001-2003)
Project partner of Euroterm. The main objective of the proposed preparatory action was to offer public sector information in different languages beyond the mere homogeneous linguistic community it
originates from. The extension of the EuroWordNet and the Inter-Lingual-Index records with public sector terminology aims at providing the infrastructure for facilitating access of European citizens and enterprises to public sector information. The action’s actual purpose was to combine effectively multilingual domain specific information into a common lexical database through a Terminology Alignment System, lowering this way the barriers across European languages. Another main goal of the preparatory action was to explore how such a multilingual lexical database could facilitate access to environmental information and contribute to the development and use of the European digital content. On a longer term we expect that the domain specific terminology incorporated into the EuroWordNet multilingual database will open up a whole new range of services in Europe at a trans-national level.
- Balkanet (Design and Development of a Multilingual Balkan WordNet): EC Framework project IST-2000-29388 (2002-2004)
Project partner of Balkanet which aimed at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages. The most ambitious feature of the BalkaNet
was its attempt to represent semantic relations between words in each Balkan language and link them together in order to develop an on line multilingual semantic network. The main
objective was the development of each's languages WordNet from available resources covering the general vocabulary of each language. Semantic relations will be classified in the
independent WordNets according to a shared ontology. Then, all individual WordNets were organized into a common database providing linking across them. Each of the WordNets
were structured along the same lines as the EuroWordNet through a WordNet Management System. This project was an excellent opportunity to explore the less studied Balkan languages
and combine and compare them cross-linguistically.
- Pidgin, 2002-2004
Financially supported by the Dutch government within the framework of the CIC-programme (Compete with ICT-competences) of the Ministries of Economic Affairs and of Educational and Cultural Affairs. In the Netherlands a consortium of five organisations started the project with
the aim to improve the cross-lingual communication on the Internet. The name PidGin has been derived from the pidgin languages that were developed between people with different language backgrounds during colonial times. The Pidgin-project is an unique project which combines the sciences of IT and linguistics. By using special techniques of both sciences, Pidgin enabled internet users all over the world:
1. to retrieve multi-lingual information from the internet (human-machine);
2. to communicate with their fellow users in their own native languages (human-human).
Pidgin combined advanced search techniques with translation strategies. To improve the searching, the computer had to be able to "understand" language by way of syntactic and semantic analysis.
The further development of these techniques can be stimulated by translation techniques that go beyond the currently available translation techniques that are based on machine translation or
translation memory. Pidgin is self learning and makes use of an enormous semantic network. The idea behind Pidgin was not to deliver perfect translations, but to improve cross-lingual communication
between man and machine and between people. It is like real life communication between two persons with a different mother tongue: by talking and writing to each other, each one of them will learn
more about the language of the other person. PidGin makes use of the same principle. Every time Pidgin is used it learns from the user and improves its own knowledge. The project has been executed
in two phases: at the end of the year 2002 the project yielded the first results of the cross lingual communication between man and machine. The perfection of the cross-lingual communication between
between people was planned for the year 2004.
- SIFT: (Selecting Information From Text) EC Framework project LRE 62030 (1994-1996)
Project-manager of the SIFT-project (LRE 62030) for Amsterdam and in charge of a team of 4 researchers and programmers. The aim of the Sift project was to develop a text retrieval system that makes use of distributive semantic representations of words. Distributive representations consist of simple semantic features with weights indicating the relevance of these features for the concept associated with a word. Using these representations equivalence can be measured in a flexible and computationally tractable way. The main result for Amsterdam has been the development of the first version of the Amsterdam Lexicon System (ALS). This is a very efficient and fast object-oriented lexical database system, developed in C and running on Unix, Windows and Macs
- Acquilex-2: Esprit-project BRA 7315 (1992-1995)
A sequel to Acquilex-i in which also other resources such as corpora are considered to build up lexicons, and in which we cooperated with the publishers Van Dale in the Netherlands, Cambridge University Press in the UK and Bibliograf in Spain.
For more information see:
- Acquilex-1 Esprit-project BRA-3030 (1989-1992)
Vossen has been senior researcher on the Acquilex-1 project (Esprit BRA-3030): Acquilex-1 was a joint enterprise of the Universities of Cambridge, Dublin, Pisa, Barcelona and Amsterdam in which we examined the feasibility of building a multilingual knowledge base usable in Natural Language Processing on the basis of information extracted from several Machine Readable Dictionaries. His work focused on LDOCE and a Van Dale monolingual Dutch dictionary, which he linked using Van Dale bilingual Dutch-English and English-Dutch dictionaries. He developed definition parsers for Dutch definitions and made programs for automatically converting the analyzed information into formal typed feature structure representations, which can be loaded into a lexical knowledge base. Finally, he developed some techniques and tools to (cross-linguistically) compare the semantic organization of different dictionaries.For more information see:
- Links (NWO 300-169-007) 1983-1986
Researcher in the Links-project in which he developed a parser for the definitions of nouns, verbs and adjectives of the Longman Dictionary of Contemporary English and stored the output (65,000 parse-trees) in the Nijmegen Linguistic Database. The Links-project was funded by the Dutch Council of Research (NWO).