The Santiago University Learner of English Corpus (SULEC)

The SULEC Corpus is a project managed by a group of researchers from the Department of English Philology at the University of Santiago de Compostela. This project was initiated in October 2002 and is financed by the Galician Department of Education. Our aim is to create a corpus of at least 1.000.000 words of oral and written learners' English with materials collected from learners of all levels (elementary, intermediate and advanced). Spoken data will be collected through semistructured interviews, short oral presentations and brief story descriptions; all of which will be recorded in audio and occasionally also in video format. The written part of the corpus will be gathered from compositions or argumentative essays following criteria similar to those of ICLE (International Corpus of Learner English). All the data provided by these

research instruments will be transcribed, computerised, and tagged with the aid of computational tools already available on the market.


The subsequent, all-embracing analysis of such data will allow us to perform investigations at different levels:


§         Phonological level: main difficulties found by these students when learning pronunciation (segmental and suprasegmental features), linguistic interferences, preferences for some specific model or linguistic variety.


§         Morphosyntatic level: word-order, concord problems, length and syntatic structures, acquisition of given constructions (negative and interrogative constructions, relative clauses, existential sentences), empty categories.


§         Lexical level: type and number of words used, frequencies of use, lexical collocations, “false friends”.


§         Discourse level: organisation of the information, use of cohesive devices, communicative strategies.


In addition, we will also explore the pedagogical applications derived from our corpus, incorporating this information to the materials used for English language teaching (dictionaries, glossaries, grammars, also reference books). Furthermore, we believe that the results of our analysis  might have important impplications for the fields of Translation and the so called Computer Assisted Language Learning (CALL).


The aim of the project is the compilation of a large and solid corpus of real language, both spoken and written, produced by Spanish learners of English. Nowadays, corpora with all these features do not exist and this would prepare the ground for the completion of a great number of subsequent works in the various different areas that are somehow related to the acquisition and the teaching of English, such as Translation and Constrative Linguistics. Although many important linguistic scholars such as Chomsky do not believe in research based on corpora, corpus-based research has been used with great success in the study of English, leading to the creation of corpora such as the British National Corpus (BNC) or the International Corpus of English (ICE). This has had a great influence on the creation of corpora to study second language performance, and therefore researchers have put together data collections such as the International Corpus of Learner English (ICLE), the Taiwanese Learner Corpus of English (TLCE) or the Japanese EFL Learner Corpus (JEFLL).


We believe in the importance of basing our reseach on a corpus. By looking at real second language performance, we do not just base our research on simple theories and hypotheses. Therefore, we expect that this project will contain interesting data showing the performance of Spanish speakers  of English, and that it will be succesfully applied to many different research purposes.