Dan Cristea
Research Activity


My main topics of research have been: discourse structure, incremental discourse parsing, the relationship between structure and reference, anaphora resolution, computational lexicography, annotated textual resources - including old Romanian documents, applications involving processing language.

In the past I have also been focused on problems related to lexical semantics, WordNet and applications of wordnets, aspects of modelling language evolution, workflows for natural language processing.

There have been two groups of people with whom I worked over time, ever renewed, since my students move so quickly as soon as they graduate, get a master or a PhD degree. One is known as the NLP-Group@UAIC-FII, hosted by the "Alexandru Ioan Cuza" University of Iași-Faculty of Computer Science, the other - as the NLP-Group@ARFI-IIT, hosted by the Institute of Computer Science of the Romanian Academy-Iași branch.

Over the years, me and my group have been constantly concerned with building resources for Romanian language as well as tools to precess this language. The Consortium for the Romanian Language: Resources & Tools (in Romanian: Consorțiul de Informatizare pentru Limba Română - ConsILR ) represents an initiative which aims to facilitate and augment the efforts of linguists and computer science researchers working on Romanian language by promoting software tools and resources for linguistic processing. The ConsILR conferences, a series of events organised yearly since 2001, are aimed to promote the research on resources and tools dedicated to natural language, with a special emphasis on Romanian.

Research projects

Following is a selection of my most representative projects.

Colours' code:

  • Magenta - projects coordinated by me
  • Blue - projectes in which I have been responsible, representing UAIC or ARFI-IIT
  • Black - projects in which I have been a simple member
  • 2019-2022 (ongoing): DeLoRo - Deep Learning for Old Romanian Language. Full title: Artificial Intelligence Models (Deep Learning) Applied in the Analysis of Old Romanian Language, official site - a project financed by UEFISCDI, project code: PN-III-P2-2.1-PED-2019-3952. The project intends to develop a technology capable of deciphering old printed and uncial Cyrillic Romanian documents and transliterate them into the Latin script.
    Partners:
    • Institute of Computer Science, Romanian Academy - the Iași branch
    • Faculty of Mathematics and Computer Science, University of Bucharest

    Total budget: 589.414 RON. Value for ARFI: 445.414 (approx. 90.100 EUR).

  • 2021-2022 (ongoing): SemRO - Computational Representation in Linguistic Big Data Processing official site - a project financed by UEFISCDI, project code: PN-III-P1-1.1-PD-2019-0660. The project, typical for the domain of Computational Linguistics, aims to develop a linguistically applicable semantic representation, with a special emphasis on Romanian data. Semantic representation is receiving growing attention in Computational Linguistics in the past few years, and many proposals for semantic schemes have been created. Combining also the recent advances in semantic parsing, the machine is more close than ever to understand texts, and it has already demonstrated applicability tohuman-like skills in summarisation, paraphrase detection, and semantic evaluation. The project will apply machine learning and word embeddings methods to process 1 bill. words, from 7 scientific fields and 70 subdomains, in order to identify meanings of words in contexts and for quantifying semantic changes, by evaluating words' uses on supervised textual data.
    Partners:
    • Institute of Computer Science, Romanian Academy - the Iași branch
    Project coordinator: dr. Mihaela Onofrei.

  • 2021-2022 (ongoing): AiCrop - Artificial intelligence applications to empower biological predictions in precision crop breeding official site - a project financed by UEFISCDI, project code: PN-III-P1-1.1-PD-2019-0619. As big data is expanding in plant breeding from high-throughput DNA sequencing, ‘omics’ technologies and digital phenotyping, new opportunities occurred for understanding crop performance in different environments. However, all these are suffering from challenging bottlenecks for data analysis. Machine learning (ML) algorithms represent one appropriate solution to make sense of big data. This project aims to test and develop new ML models that can efficiently integrate various kinds of big data to improve quantitative genetic predictions for selection and classification of superior crop varieties. Appropriate datasets will be integrated in ML models in order to accelerate breeding progress and to meet future human needs in this field. The project will generate a new knowledge-based platform to rapidly advance crop improvement progress in the coming century. The long-term vision of this project foresees self-evolving, pattern-based prediction models that adapt to breeding progress, to different crops, to expanding data complexities, and to ever-changing environmental constraints on agricultural production.
    Partners:
    • Institute of Computer Science, Romanian Academy - the Iași branch
    • Faculty of Agriculture, University of Agricultural Sciences and Veterinary Medicine "Ion Ionescu de la Brad", Iași"
    Project coordinator: dr. Iulian Gabur.

  • 2013-2017: DRuKoLa: Comparing German and Romanian languages through corpus linguistics technologies - official site - a project of the Romanian Academy. The corpus linguistics querying interface KorAP, developed at the Institut für Deutsche Sprache, Mannheim, was used as a technological infrastructure to implement the Corpus of Contemporary Romanian Language (CoRoLa), then this interface was used by linguists to compare linguistic and syntactic structure of the German and Romanian languages.
    Partners:
    • Institut für Deutsche Sprache, Mannheim
    • University of Bucharest
    • Institute of Computer Science, Romanian Academy - the Iași branch
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    Project coordinator: dr. Marc Kupietz.

  • 2017-2020: ReTeRom: Resources and technologies for developing human-machine interfaces in Romanian - a complex project, with the component project COBILIRO - Bimodal corpus for the Romanian language annotated on multiple levels, was financed by UEFISCDI, code: PN-III-P1-1.2-PCCDI-2017-0818. - official site - The general objectives of the ReTeRom project have been: Project 1: COBILIRO => create a thesaurus with audio and textual resources, annotated at different acoustic and linguistic levels, to become the most important reference for this type of resource for the Romanian language; Project 2: TEPROLIN => develop a set of advanced technologies for the processing of natural language (text) in Romanian: morphological, syntactic and semantic analysis of texts, with annotation of the text collected in Project 1 (COBILIRO) on different linguistic levels (phoneme, syllable, word, part of speech, etc.). Project 3: TADARAV: => develop a set of advanced technologies for the automatic phonetic annotation of the voice signal collected in the corpus of Project 1 (COBILIRO), respectively for the creation of automatic speech recognition interfaces in Romanian using the language models generated in Project 2 (TEPROLIN). Project 4: SINTERO => develop an advanced technology for the synthesis of high-quality and expressive speech in Romanian, based on the resources collected in Project 1 (COBILIRO) and the automatic annotations generated in Project 2 (TEPROLIN) for text and in Project 3 (TADARAV) for audio data.
    Partners:
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • "Alexandru Ioan Cuza" University of Iași
    • University POLITHNICA of Bucharest
    • Technical University of Cluj-Napoca
    Project coordinator: acad. Dan Tufiș.

  • 2013-2017: CoRoLa: The COrpus of contemporary ROmanian LAnguage - official site - a project of the Romanian Academy. The project resulted in building a corpus of approximately 1 mld. Romanian words coverring the period after the Second World War, 300 hours of voice recordings, all from 17 domains and more literaty styles. acquired from novels, press articles, bloggs, scientific writings, plays, etc. The texts have been cleaned, segmented at sentence and word borders, and the words have been automatically annotated with morpho-syntactic information. The multi-criterial search acces is allowed through 3 interfaces. The corpus can be used by people interested to learn Romanian language from examples, in classroom for education purposes, but also by researches intersted in language studies, language modeling for automatic processing of the Romanian language, development of translation models, automatic speech recognition and synthesis, and many types of language-based applications. The data collection was based on protocols signed with the text providers, holders of intellectual property rights. The texts are accompanied by metadata and have been subjected to a processing chain that combines manual computer-assisted preprocessing and fully automatic processing.
    Partners:
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • Institute of Computer Science, Romanian Academy - the Iași branch
    Project coordinator: acad. Dan Tufiș.

  • 2014-2018: E-READ: Evolution of Reading in the Age of Digitization - official site - a COST action IS1404 aiming to improve scientific understanding of the implications of digitization on reading and help individuals, disciplines, societies and sectors across Europe to cope optimally with these effects.


  • 2014-2016: MappingBooks - Let's jump in the book! - A project financed by the Romanian Ministry of Education and Research (UEFISCDI) under the Partnerships Programme (PN II Parteneriate, competition PCCA 2013), project code: PN-II-PT-PCCA-2013-4-1878. The project develops a new type of electronic product with a potential high impact in education. The technology makes heavy use of natural language processing, web cartography, web mapping, mixed reality techniques and ambient intelligence. Toponyms and other mentions of interest to the reader, contained in the book, are supplemented with different types of information, diagrams, graphical data, links to virtual sites.
    Partners:
    • "Alexandru Ioan Cuza" University of Iasi
    • SIVECO S.A. Bucharest
    • "Stefan cel Mare" University of Suceava
    Value for UAIC: 569,170 RON (129,357 EUR).

  • 2013-2017: ENeL - European Network of e-Lexicography - official site - a COST action aiming to establish an European network of lexicographers and computer scientists interested: to give users easier access to dictionaries in electronic form, to organise a systematic exchange of expertise and common standards and solutions in the representation of lexicographic resources, and to develop a common approach to e-lexicography for fully embracing the pan-European nature of much of the vocabularies of the languages spoken in Europe.

  • 2010-2013: ATLAS - Applied Technology for Language-Aided CMS - official site. A project funded by the European Commission under the ICT Policy Support Programme, Grant Agreement 250467. ATLAS' goals were to establish an innovative software platform providing three online services for heterogeneous multilingual content management, equipped with natural language processing capabilities, including automatic annotation, summarisation, categorisation and machine translation. The collaborative, user-oriented, shared and interoperable services are:
    • i-Publisher: Automatic processing of the Web content (categorisation, summarisation, annotation etc.);
    • i-Librarian: The ability to easily create, organise and publish various types of documents;
    • EUDocLib: A publicly accessible repository of EU documents, providing enhanced navigation and easier access to relevant documents in the user's language.
    • The technology was developed for 6 languages: Bulgarian, English, German, Greek, Polish and Romanian.
    Partners:
    • Tetracom Interactive Solutions, Sofia - coordinator
    • German Institute for Artificial Intelligence, Saarbruecken
    • Atlantis Consulting SA, Athens
    • Institute for Bulgarian Language, Sofia
    • Institute of Computer Science, Polish Academy of Sciences, Warsaw
    • University of Hamburg
    • "Alexandru Ioan Cuza" University of Iasi
    • University of Zagreb
    • Institute of Technologies and Development Foundation, Sofia
    Project coordinator: Anelia Belogay.
    Value for UAIC: 98,341 EUR.

  • 2011-2013: METANET4U - official site - an ICT-PSP project, Grant Agreement 270893. METANET4U was part of the META-NET Network of Excellence, a cluster of projects aiming at fostering the mission of META (the Multilingual Europe Technology Alliance), dedicated to building the technological foundations of a multilingual European information society. The goal of METANET4U was to contribute to the establishment of a pan-European digital platform that makes available language resources and services, datasets and software tools, for speech and language processing, and to support a new generation of exchange facilities for them. All resources and tools gathered during the project, documented and updated, were delivered through the network of open digital exchange platforms META-SHARE. Partners:
    • Faculty of Sciences, University of Lisbon - coordinator
    • Instituto Superior Tecnico, Lisbon
    • University of Manchester
    • "Alexandru Ioan Cuza" University of Iasi
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • University of Malta
    • Technical University of Catalonia
    • Universitat Pompeu Fabra, Barcelona
    Project coordinator: prof. Antonio Branco.
    Value for UAIC: 242,338 EUR.

  • 2008-2011: CLARIN - Common Language Resources and Technology Infrastructure - official site. An European initiative committed to establish an integrated and interoperable research infrastructure of language resources (all knowledge sources based on language, written or spoken) and its technology (tools to carry out operations on such language material). Features of this technology are:
    • integration: the resource and service centres are connected via Grid technology and form a virtually integrated domain;
    • interoperability: the resources and services will be based on Semantic Web technologies to overcome format, structure and terminological differences;
    • stability: the resources and services are offered with a high availability;
    • persistency: the resources and services are planned to be accessible for many years so that researchers can rely on them;
    • accessibility: the resources and services are accessible via the web; different access methods and training possibilities are offered tailored to the needs of the communities making use of them;
    • extendability: the infrastructure is open so that new resources and services can be added easily.
    Value for UAIC: 77,961 EUR.

  • 2008-2011: ALEAR - Artificial Language Evolution on Autonomous Robots - official site. An FP7 project aiming the achievement of open-ended cognitive development and open-ended verbal dialogues among fully embodied situated agents (humanoid robots, mechanisms which include sensori-motor intelligence, scripts for establishing the turn-taking interaction among them, perceptual processes, processes that perform the conceptualisation of what to say, the expression of these conceptualisations in language and processes that perform the parsing of sentences and their interpretation in sensori-motor experience). ALEAR proved that humanoid robots may evolve their own artificial languages adapted to the environment and task settings in which they are placed. Partners:
    • Humboldt University, Berlin - coordinator
    • SONY CSL - Paris
    • Osnabruck University
    • Autonomous University of Barcelona
    • Vrije Universiteit Brussel - Brussels
    • "Alexandru Ioan Cuza" University of Iasi
    Project coordinator: prof. Luc Steels.
    Value for UAIC: 222,781 EUR.

  • 2009-2011: ALEAR-RO (Artificial Language Evolution on Autonomous Robots) - a mirror project sponsored by the Romanian Ministry of Research.

    Value for UAIC: 278,155 RON. (57,477 EUR)

  • Sept. 2007 - Dec. 2010: eDTLR - The Thesaurus Dictionary of Romanian Language in Digital Form - official site: building the electronic version of the biggest Romanian dictionary, edited and printed by the Romanian Academy between 1913 and 2010. The two series of the Dictionary, the Dictionary of the Academy (DA) and the Dictionary of the Romanian Language (DLR) include 36 volumes, more than 15,000 pages, about 175,000 entries and more than 1,300,000 examples. The creation of the electronic version went through the following steps: scanning, OCR, proofreading (first volunteers in a collaborative effort, then experts), parsing of the dictionary entries, building the database (as XML TEI-P5 files), scanning of sources (over 2,500 volumes) and building of indexing and browsing mechanisms. Partners:
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi - coordinator
    • Institute of Linguistics "Iorgu Iordan - Alexandru Rosetti", Romanian Academy, Bucharest
    • Institute of Romanian Philology ""Alexandru Philippide", Romanian Academy, Iasi
    • Institute of Literary History "Sextil Puscariu", Romanian Academy, Cluj-Napoca
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • Research Institute of Computer Science, Romanian Academy, Iasi
    • Faculty of Letters of the "Alexandru Ioan Cuza" University, Iasi
    Value for UAIC: 346,065 RON (80,480 EUR).

  • 2007-2010: SIR-RESDEC - Open Domain Question Answering System for Romanian and English. The project developed an advanced, interlingual and parametrisable system for interpretation of questions expressed in Romanian and answering them in English and Romanian. Questions are relative to a dynamic collection of documents, containing an arbitrary number of texts. The domains chosen as case studies have been the legislative domain and the bioinformatics domain. Partners:
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest - coordinator
    • Central Institute for Informatics, Bucharest
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi
    Project coordinator: acad. Dan Tufiș.
    Value for UAIC: 246,520 RON (57,330 EUR).

  • Since 2008: Institutional Member of FLaReNet - Fostering Language Resources Network - official site: a network of excellence (in eContentPlus Programme, grant agreement no. ECP-2007-LANG-617001) with the mission to identify priorities, short, medium, and long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry.

    At the end of the project (31 August 2011) FLaReNet counted 38 partners, 99 institutional members, 25 support groups and 400 individual subscribers from all over the world.

  • 2007-2010: COST A31 - Stability and Adaptation of Classification Systems in a Cross-Cultural Perspective - official site. Coordinator: CNRS Paris.

  • 2006-2008: LT4eL - Language Technology for eLearning - official site: an FP6 project that aimed to apply multilingual language technology tools and semantic web techniques for improving the retrieval of learning material. The developed technology facilitate personalized access to knowledge within learning management systems and support decentralisation and co-operation in content management. The LT4eL technology has been developed for 9 languages: Bulgarian, Czech, Dutch, English, German, Maltese, Portuguese, Polish and Romanian. Partners:
    • University of Utrecht - coordinator
    • "Alexandru Ioan Cuza" University of Iasi
    • University of Lisbon
    • Charles University, Prague
    • Institute of Parallel Processing of Information, Bulgarian Academy of Sciences, Sofia
    • Eberhard Karls University, Tuebingen
    • Institute of Computer Science, Polish Academy of Sciences, Warsaw
    • School of Communication, Winterthur, Switzerland
    • University of Malta
    • University of Koeln
    • The Open University, Milton Keynes, UK
    Project coordinator: dr. Paola Monachesi.
    Value for UAIC: 116,155 EUR.

  • 2006-2008: RolTech - Romanian Language Technologies - official site: an INTAS project aimed to acquire electronic resources for the Romanian language, to develop Romanian language processing tools, and to create applications based on these resources. Partners:
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi - coordinator
    • Institute for Computer Science of the Moldavian Academy of Sciences, Chisinau
    • University of Sheffield
    Value for UAIC: 7,572 EUR.

  • 2006-2008: ROTEL - Intelligent Systems for Semantic Web, Based on Ontology Logics and Natural Language Technologies. Applications for Romanian - official site. A project financed by the Romanian Ministry of Education and Research. Partners:
    • Central Institute for Informatics, Bucharest - coordinator
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi
    Value for UAIC: 330,339 RON (76,823 EUR).

  • 2006-2008: InterOb - creation of a three-dimensional model of the human head, capable of expressing emotions. The chosen approach was to simulate the physical properties of the anatomical components that are part of the human head: skeleton, muscles, skin. Partners:
    • "Stefan cel Mare" University of Suceava - coordinator
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi
    Value for UAIC: 192,666 RON (44,806 EUR).

  • 2006-2008: e-MANAGE - Stabilirea si adaptarea sistemelor de clasificare dintr-o perspectiva cros-culturala"

    Value for UAIC: 90,000 RON (20,930 EUR).

  • 2004-2007: Knowledge Web - a network of excellence aiming to foster the creation of Semantic Web (UAIC-FII has been invited as a non-financed member).

  • 2004-2005: Enhancing prosodic aspects in Romanian Text-To-Speech Synthesis (a CNCSIS grant). Partners:
    • Romanian National Institute of Inventics, Iasi - coordinator
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi - coordinator
  • 2004-2005: Studies regarding the acquisition of the Dictionary of Romanian Language in electronic form (a CNCSIS grant) Partners:
    • Institute of Romanian Philology "Alexandru Philippide", Iasi - coordinator
    • Faculty of Computer Science of the "Alexandru Ioan Cuza" University of Iasi
  • 2001-2004: IST-2000 29388 Balkanet - official site -, the project that has built a network of wordnets for 5 Balkan languages (Bulgarian, Greek, Romanian, Serbian and Turkish), aligned with the English WordNet and the Czech WordNet. When finished, the Romanian WordNet included 25,000 synsets (sets of synonym senses of words). The partner RACAI (Romanian Academy) continued to finance the development of the Ro-WN after the end of the project (it includes now almost 60,000 synsets; visit the RACAI online browser). Partners:
    • Databases Laboratory, University of Patras - coordinator
    • Computer Technology Institute, Athens
    • "Alexandru Ioan Cuza" University of Iasi
    • Research Institute for Artificial Intelligence, Romanian Academy, Bucharest
    • Institute of Bulgarian language, Sofia
    • Sabanci University, Istanbul
    • Faculty of Informatics, Mararyk University, Brno
    • Memodata, Caen
    • University of Plovdiv
    • University of Athens
    Value for UAIC: aprox. 200,000 EUR.

  • 2001-2004: BALKANET-CORINT Creation and development of a multilingual wordnet of the Balkan languages (a mirror project sponsored by the Romanian Ministry of Research). Partners:
    • RACAI, Bucharest
    • UAIC-FII, Iasi
  • 1998-2001: TELRI II - official site. Objectives (extracts from the oficial TELRI II web page): to strengthen the pan-European infrastructure for the multilingual language research and development community; to collect, promote, and make available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge; to offer a customized comprehensive service to academic and industrial users; to prepare and organize research and development projects focusing on translation aids, multilingual authoring systems, information retrieval, etc. Coordinator: University of Mannheim.

  • 1998-1999: ELAN (an inception of TELRI) - an early initiative aiming to create/reinforce international standards by conforming a significant part of the data of their members to a common format, to design a common query language and to operate on such a basis an experimental service network that will make accessible a large stock of electronic resources and that will follow awareness-raising policies. Among partners: University of Liege, Institute of Deutcher Spracher - Mannheim, etc.

  • 1995-1998: TELRI I - Trans-European Language Resources Infrastructure - official site. Objectives (extracts from the oficial web page): an EC-funded initiative for creating of a viable infrastructure involving European language and language technology centres, to provide a platform for industry, research institutes and universities, and to supply the NLP community with public domain monolingual and multilingual language resources, such as: corpora, machine readable dictionaries and lexica, lexical data bases, and software tools for the creation, re-use, maintenance, valorisation and exploitation of linguistic data. Coordinator: University of Mannheim.

  • 1995: PROSODICS - development of software for analysis and visualisation of the prosody of the spoken utterances in exercises assisting foreign language learning. A project sponsored by University of Venice.

  • 1987-1989: QUERNAL (QUERy by NAtural Language) - development of an interactive and configurable dialogue system able to answer questions in Romanian addressing a database. Based on QUERNAL, a number of applications have been built:
    • For the Institute of Metallurgy, Bucharest: a dialogue system for their metallurgy database. Co-partner: ICI-Bucharest.
    • For the "Flamura Rosie" enterprise Sibiu: a dialogue system for their personnel and salaries database. Co-partner: ICI-Bucharest.
    • For the Moinesti Drilling-Production Trust: a dialogue system to their database of petroleum drilling and exploitation records.
  • 1984-1985: IURES (I Understand and Reply Eliminating Syntax) - a natural language dialogue system acting as a configurable interface to any semantic content expressed as a semantic network. Based on IURES, a number of applications have been built:
    • For ICI-Bucharest: a dialog system accessing the National Software Library, with ICI Bucharest.
    • For the Research Institute on Hydrology Iasi: a dialogue system on Geography of Romania.
  • 1984: Contract 4709/27.04.1984: Human-computer communication intermediated by natural language. Financed by ICI Bucuresti

  • 1983: Contract 1906/22.02.1983: Human-computer communication intermediated by natural language. Financed by ICI Bucuresti

  • 1882: Contract 4774/28.04.1982: Human-computer communication intermediated by natural language. Financed by ICI Bucuresti

  • Last update: August 2013