Project details
Very Large Language Corpora and Their Automatic Analysis
| Project Identification: | GA405/03/0913 | ||||||
| Project Period: | 1/2003 - 12/2005 | ||||||
| Investor: | Czech Science Foundation | ||||||
| Programme / Project Type: | Standard Projects - | ||||||
| MU Faculty/Unit: |
| ||||||
| Cooperating Organization: |
| ||||||
| Field: | BD - Information theory (B - Physics and mathematics) AI - Linguistics (A - Social sciences) JD - Use of computers, robotics and its application (J - Industry) | ||||||
| Publications/Results: | more | ||||||
| Keywords: | Very Large Corpora; Natural Language Processing; Statistical Methods in NLP | ||||||
Language corpora are an indispensable part of current linguistic research. They are used for various purposes, from simple lookup for particular words to sophisticated use for automatic computer training in statistical language modeling or automatic analysis at various levels performed fully automatically on a computer. Usability of both monolingual as well as multilingual and spoken language corpora is substantially enhanced if the language material contained in them is linguistically analyzed. Annotation can reflect both the form and the function of linguistic units in their context. The primary goal of the project is to enhance our understanding of the natural language system in general and Czech in particular, and to develop and/or enhance statistical machine learning and symbolical methods (and their combinations) in order to be able to automatically analyze large quantities of naturally occurring texts, whether they are written or spoken. Results of previous projects in the field will be used, especially existing data (texts) and methodology. The role of very large corpora is twofold, both as a source of automatically acquirable language information, and as a target for application of the methods of automatic analysis developed in the course of the project (mainly for lexicographic purposes and also for the purpose of further linguistic studies). The results of the project will be published, including software tools and data.












Czech Science Foundation