Project details

 

Very Large Language Corpora and Their Automatic Analysis

Project Identification:GA405/03/0913
Project Period:1/2003 - 12/2005
Investor:link to a new windowCzech Science Foundation
Programme / Project Type:Standard Projects -
MU Faculty/Unit:
Faculty of Informatics
MU Investigator:Assoc. Prof. PhDr. Karel Pala, CSc.
Cooperating Organization:
link to a new windowFaculty of Mathematics and Physics CU Praha
Responsible Person:Prof. RNDr. Jan Hajič, Dr.
link to a new windowCharles University Prague
Field:BD - Information theory (B - Physics and mathematics)
AI - Linguistics (A - Social sciences)
JD - Use of computers, robotics and its application (J - Industry)
Publications/Results:more
Keywords:Very Large Corpora; Natural Language Processing; Statistical Methods in NLP
Annotation

Language corpora are an indispensable part of current linguistic research. They are used for various purposes, from simple lookup for particular words to sophisticated use for automatic computer training in statistical language modeling or automatic analysis at various levels performed fully automatically on a computer. Usability of both monolingual as well as multilingual and spoken language corpora is substantially enhanced if the language material contained in them is linguistically analyzed. Annotation can reflect both the form and the function of linguistic units in their context. The primary goal of the project is to enhance our understanding of the natural language system in general and Czech in particular, and to develop and/or enhance statistical machine learning and symbolical methods (and their combinations) in order to be able to automatically analyze large quantities of naturally occurring texts, whether they are written or spoken. Results of previous projects in the field will be used, especially existing data (texts) and methodology. The role of very large corpora is twofold, both as a source of automatically acquirable language information, and as a target for application of the methods of automatic analysis developed in the course of the project (mainly for lexicographic purposes and also for the purpose of further linguistic studies). The results of the project will be published, including software tools and data.