Corpora for the coming decade
How should we design and gather a corpus that will meet the needs of linguists and others over the next ten years? The most-cited model for corpus design over the last two decades has been the British National Corpus. While its design was clearly excellent for its time and has served very well, it is now approaching twenty years old. It is from the pre-web world, when electronic text was in limited supply (and for many text types not available at all). We need new models for a world where electronic text is available in vast quantities, for most text types, so where corpora can be very large and very cheap to prepare. I will talk about two current projects, both for English, one (Big Web Corpus or BiWeC) concentrating on size, and the other (the New Model Corpus) concentrating on corpus structure, markup, and a collaborative model. Our hope is that the two strands will converge, giving a very large corpus which has many useful, large and well-specified subcorpora, which is richly marked up, and which supports a wide range of research questions across the linguistics and language-technology worlds. The talk will include a demo of the Sketch Engine (a corpus query tool capable of handling multi-billion-word, richly-marked-up corpora) and also some comments on the relation between what we do, in corpus linguistics, and what Google and other commercial search engines do.