Bookmarks for Corpus-based Linguists 

Major English Corpora |  Other Recent English Corpora  |  Spoken Corpora  |  Diachronic Comparisons  |  Historical Corpora | 1st language acquisition |  Learner/Lingua Franca Corpora  |  Specialised Corpora  |  Text Archives & Corpus Distribution Sites  |  Non-English & Multilingual Corpora  |  The Web as a Corpus |  Parsed Corpora  |  D-I-Y Corpora |  Multimedia Corpora & texts |  *Free, web-accessible Corpora   |  [Bookmarks HOME]


Corpora, Collections, Data Archives*


* What are the differences among these terms? See below.

* Ok, so how do I actually get my hands on on these corpora, & how can I search them? See below.

* For freely accessible, on-line corpora, see separate section below.

 Major English Language General Corpora

Kennedy (1998) suggests a three-way categorisation of corpora.

Pre-electronic Corpora: (biblical & literary studies, early dictionaries, etc.)

1st-generation Major Corpora:

         Brown, LOB, LLC, Kolhapur, Wellington, etc.

2nd-generation Corpora:

         Mega Corpora:

         British National Corpus (BNC);  Corpus of Contemporary American English (COCA);  COBUILD Bank of English

         Not-so-mega Corpora:

         ICE-GB, American National Corpus (ANC), etc.

[Some people call the major "general" corpora above "balanced corpora", but I personally avoid this term. To say something is "balanced" suggests that linguists have agreed on what proportions to assign to different genres (patently untrue) in order to achieve this "balance", while for others "balanced" just means an equal sample of spoken & written. For me, the best terms for the above are "wide-coverage" or "general language", though I personally avoid generalizations about "language in general".]

 

 

Other General Corpora for Written English

(excluding those already in the above lists; please also note other categories (for speech corpora, for instance) below; the same corpus may appear under more than one category, for easy access)

BE06 Corpus (British English 2006)

1-m words, published general written British English; same sampling frame as the LOB and FLOB corpora; consists of 500 files of 2,000 word samples taken from 15 genres of writing published between 2005-2008. Copyright restricted. Texts not available—can only be searched online here (registration required).

Corpus of Contemporary American English

(by Mark Davies, Brigham Young University)

c. 360 million wds, including 20m for each year from 1990 to the present. Each year (& therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, & academic. In addition, the corpus will be continually updated--20m wds each year. (Because of copyright & licensing issues, the texts themselves are not available for download—they can only be searched online.)

FLOB (Freiburg-LOB Corpus of British English)

1990s analogue to the LOB corpus (1 m wds, written British English); 2006 analogue to LOB/FLOB is the BE06 Corpus

FROWN (Freiburg-Brown Corpus of American English)

1990s analogue to the Brown corpus (1 m wds, written American English)

ICE-Project Corpora

International Corpus of English

(collects data from countries in which English is the 1st or 2nd lg, or a second official lg)

The main ICE web site has downloadable sample sound files from several ICE teams. Current ICE national varieties completed include Canada, East Africa, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, New Zealand, the Philippines, & Singapore. Others in preparation (as at Aug 2009) include Australia, Fiji, Malaysia, South Africa, Sri Lanka, Trinidad and Tobago, Kenya, Malawi, Tanzania, Nigeria, Pakistan,  Malta and the United States. See also the description under 2nd-generation Mega-corpora.  

LUCY (documention is here)

structurally analysed written British English (drawn from the British National Corpus); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children).

SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English)

130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked).

Longman Written American Corpus

[This blurb is from their web site. Availability is unknown, as with all proprietary corpora... no comment on the use of 'corpuses'...] 

A dynamic corpus of 100 m wds from newspapers, journals, magazines, best-selling novels, technical & scientific writing, & coffee-table books..composition constantly being refined & new material added.... based on the general design principles of the Longman Lancaster English Language Corpus & the written component of the British National Corpus. Like other corpuses[sic] in the Longman Corpus Network, wds can be concordanced, wordlists created, & statistical features analysed, allowing lexicographers to compare & contrast usage in British & American English.

Reuters Corpora

(registration required to get the CDs, or get the older Reuters-21578 here.)

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 [810,000 news stories]

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, & Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.]

 * Some of the above-mentioned corpora are conveniently bundled together on the new ICAME Corpus Collection on CD-ROM (click to find out more). It comes with the concordancers WordSmith, TACT & WordCurncher.

 [Top]

 


Spoken Corpora of English

For phonemic/acoustic/articulatory databanks (mainly isolated words, phonemes, or sentences), see separate list of links here (Kiel) or the ELRA/ELDA pages or the LDC. Some people make a distinction between 'speech corpora' (suitable for acoustic/phonetic studies) & 'spoken corpora' (containing transcriptions of any type of spoken language). I use 'spoken corpora' here as an umbrella term for both types.

ANDOSL

(Australian National Database of Spoken Language)

comprises spoken language as it occurs in a variety of major speaker groups in Australia (both native-born & overseas-born migrants); data was elicited either by written material which was read aloud (the "read speech" data) or by graphical material which was discussed by two speakers thereby generating spontaneous speech (the "map task" or "spontaneous" data). Speakers were rigorously selected within phonologically defined speaker groups, each group balanced for age ranges & gender. Recorded in a high quality environment at the National Acoustic Laboratories. Manual annotation at both word & phonemic levels using highly trained transcribers is being combined with automatic methods.

BACKBONE (European languages, incl. English) On-line Search here.

BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).

BASE (British Academic Spoken English)

See separate entry under "Specialised Corpora" below

CANCODE

(Cambridge & Nottingham Corpus of Discourse in English)

Not generally available for research except at specific sites (annoying!). 5 m wds of spontaneous speech collected between 1995 & 2000. CANCODE has all the transcripts coded to reflect the relationship between the speakers–whether they are intimates (living together), casual acquaintances, colleagues at work, or unknown to each other. Speech events were recorded at hundreds of locations across the British Isles, covering a wide variety of situations: casual conversation, people working together, people shopping, people finding out information, discussions, etc. [see also the Centre for Research in Applied Linguistics, University of Nottingham ]

CHRISTINE

(Spoken version of SUSANNE Corpus)

SUSANNE-meets-spoken-English; Geoffrey Sampson's project 

CSLU Speech Corpora

(Center for Spoken Language Understanding)

several free speech corpora (telephone recordings, conversations with children, pronunciations of isolated digits & alphabets, etc.)

CUCASE

(City University Corpus of Academic Spoken English; forthcoming)

A multimedia corpus currently being compiled (Jan 2008-Sept 2009, initially) by David Y.W. Lee. Will mirror the design of MICASE & BASE; contains academic lectures and student presentations in English at a Hong Kong university (native & non-native speakers). English proficiency courses are excluded. Aim is for 2m words.

Diachronic Corpus of Present-day Spoken English (DCPSE)

800,000 wds (87,188 parse trees) of fully-parsed & annotated spoken British English from the 1950s to 1990s; composed of two 400,000-word samples of spoken English from the London-Lund Corpus (late 1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be consistent with ICE-GB & searchable using ICECUP,  (Survey of English Usage, University College London).

Dialogue Diversity 'Corpus' (DDC)

See separate entry under "Specialised Corpora" below

ELISA (English Language Interview Corpus as a Second-Language Application)

60,000 wds, 28 interviews with native speakers of English; multimodal (video files available). They talk about their professional career (e.g. in tourism, politics, the media or environmental education). Free for non-commercial use. Has an on-line concordancer. University of Tuebingen.

EUSTACE (Edinburgh University Speech Timing Archive & Corpus of English)

Free for non-commercial use; esp. useful for phonetics researchers & speech technologists working on synthesis & recognition. Comprises 4608 spoken sentences spoken by six speakers of British English; sentences were designed to examine a number of durational effects in speech & are controlled for length & phonetic content. Subconstituents of key words in each sentence have been identified by labels in xlabel (ESPS) format & notes have been made about the prosodic realisation of the sentences. Example sentences available for playback. Speech waveform files are available in .wav (RIFF) format & .sd (ESPS) format.

FRED

 (Freiburg English Dialect Corpus)

Sampler version is here, along with a manual

A specialized corpus of British English dialects covering nine major dialect areas in Britain; 370 texts; c. 2.45 m wds; 300 hours of speech, excluding interviewer utterances (recorded between 1968 & 2000-- some recordings were taken from oral history interviews), 420 different informants (a majority are non-mobile old rural males who typically grew up before WW I.). Recordings will be made available.

Hansard

Parliamentary Proceedings from: the United Kingdom (UK) ;  Canada  ; Australia ;  New Zealand

(Not really 'corpora' in the sense of fixed, formatted texts, but collections of transcripts)

HCRC Map Task Corpus  

(by the Human Communication Research Centre at Edinburgh University) See also LDC Catalog entry

a set of 8 CD-ROMs containing linked audio & transcriptions of a total of about 18 hours (roughly 150,000 word tokens) of spontaneous (task-oriented) speech that was recorded from 128 two-person conversations according to a detailed experimental design. OR Download/ftp a gzipped tar file of the entire corpus (tar [compressesd] file is 10MB, whole corpus is 80MB; 2562 XML files & a dtd directory containing 15 dtd files.)

ISLE corpus  

See separate entry under "Specialised Corpora" below

IViE (Intonational Variation in English) Older site here

created to investigate cross-varietal & stylistic variation in English intonation. Focus is on modern or mainstream dialects: nine urban varieties of English spoken in the British Isles, viz. Belfast, Bradford (bilingual Punjabi/English speakers), Cambridge, Cardiff (bilingual Welsh-English speakers), Dublin, Liverpool, Leeds, London, & Newcastle; approximately 36 hours of speech data in five different speaking styles

Lancaster/ IBM Spoken English Corpus (SEC)

52,000 wds of mostly prepared (& mostly monologic) southern British English speech (approximating to RP), collected in the period 1984-1987; orthographic & prosodic transcription & in two versions with grammatical tagging (like those for the LOB Corpus). Detailed description: see: -- See the ICAME Corpus Collection's SEC manual for a description of the SEC & the AMALGAM web site for the SEC Tag-set Ref: A Corpus of Formal British English Speech (1996), Knowles, Gerald, Briony Williams & Lita Taylor (eds.), London: Longman. A collection of research papers based on the SEC has also been published as Working with Speech (1996), Knowles, Gerald, Anne Wichmann & Peter Alderson (eds.), London: Longman.

LeaP (Learning the Prosody of a foreign language)

(Bielefeld)

a large corpus of foreign language learners' speech (Target Languages are English & German, Native Languages span a wide range: German, Polish, Arabic, Chinese, Spanish, etc.). A multitude of data of various types is being collecetd: the corpus of spoken language will consist of at least 400 recordings of between 2 & 20 minutes length. It comprises there different speech styles: (i) read speech (a story of 268 wds); (ii) prepared speech (the re-telling of the story); (iii)  free speech from an interview context. The central question of the project is to provide a detailed decription of non-native prosody. The second line of research aims to explore whether & how it is possible for learners of a foreign language to acquire the prosody of the target language without having a distinct "foreign accent". In a longitudinal study, various methods of teaching prosody will be tested.

Limerick Corpus of Irish-English (L-CIE)

one-million word spoken corpus of Irish English discourse; conversations recorded in a wide variety of mostly informal settings throughout Ireland (excluding Northern Ireland); currently (accessed: Feb 2008) 375 transcripts; mainly casual conversation, but also over 200K wds of professional, transactional & pedagogic Irish English; not designed to be geographically representative (does not include data from every county); speakers range in age from 14 to 78; equal representation of both male & female speakers; designed to allow inter-corpus comparisons with CANCODE

London-Lund Corpus (LLC)

 See description here

Longman Spoken American Corpus

5 m wds, demographically sampled speech from 12 regions (30 states) across the continental US; coordinated by the University of California at Santa Barbara; everyday conversations of more than 1,000 Americans of various age groups, levels of education, & ethnicity.  PDF with more information is here:

Machine-Readable Spoken English Corpus
(MARSEC)

Some notes on MARSEC version 2 here (latest) or  here (outdated).

MICASE (Michigan Corpus of Academic Spoken English)

See separate entry under "Specialised Corpora" below

Nationwide Speech Project Corpus

a corpus of spoken language containing recordings of young male and female talkers (60 in total) from six regions of the United States. Speech samples include isolated words, sentences, passages, and interview speech. The purpose of the Nationwide Speech Project was to develop a corpus of spoken language that can be used in acoustic and perceptual studies of regional dialect variation in the United States

Newcastle Electronic Corpus of Tyneside English (NECTE)

a corpus of dialect speech from Tyneside in North-East England. It is based on two pre-existing corpora: the Tyneside Linguistic Survey (TLS) project (late 1960s), and the Phonological Variation and Change in Contemporary Spoken English (PVC) project (1994). NECTE amalgamates the TLS and PVC materials into a single Text Encoding Initiative (TEI)-conformant XML-encoded corpus and makes them available in a variety of aligned formats: digitized audio, standard orthographic transcription, phonetic transcription, and part-of-speech tagged.

Northern Ireland Transcribed Corpus

400,000 wds transcribed speech from 42 locations, across three age groups. Contact the Oxford Text Archive.

PROSICE Corpus

a collection of re-recorded ICE-GB texts with high technical specifications; syntactically analysed & temporally aligned. See here for more info.

Reading/Leeds Emotional Speech Corpus

prosodically & paralinguistically coded speech corpus for investigating suprasegmental & affective information in the speech signal. 4.5-hour database of machine-readable speech, of which 26 mins were transcribed using the extended ToBI system. Unfortunately, this corpus is NOT available for use by others, but you can find out more info from the people listed on the website, & also from here.

Saarbruecken Corpus of Spoken English (ScoSE)

See separate entry under "Specialised Corpora" below

Santa Barbara Corpus of Spoken American English (SBCSAE)

(University site is here)

recordings of people talking -- people from all over the United States, in all walks of life, talking about & doing all sorts of things; 249,000 wds; 60 discourse segments of between fifteen & thirty minutes each.

 

Transcripts & audio can be downloaded from the TalkBank site, and some can be heard & read at the same time (as multimedia presentations) through any browser from the TalkBank browser page (click on "CABank", then on "SBCSAE", then on one of the transcripts, then press the "play" button for Quicktime.)

Spoken Corpus of the Survey of English Dialects

See DRH (Digital Resources for the Humanities) Program

Switchboard Corpus (SWB)

(LDC version & (more recent) ISIP version)

a corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations averaging 6 minutes in length each) recorded in the early 1990s; c. 3 m wds (3,044,734) of text, spoken by 543 unique speakers (302 males & 241 females) from most major dialect groups of American English. Info on the speakers' age, sex, education & dialect region. On average, each speaker participates in about 9 calls (but it ranges from 1 to 32).

Talkback Radio Corpus  (Australian)

currently around 200,000 words; is an element of the Australian English Grammar project. Talkback programs from the ABC and commercial radio stations all over Australia are being collected & transcribed to provide examples of spontaneous public speech.

TRAINS Spoken Dialogue Corpus on CD-ROM (University of Rochester web site)

six & a half hours' worth of human-human dialogues; includes 55,000 wds & about 5,500 speaker turns. Audio files for the dialogues are available on the CD-ROM; 

Translanguage English Database (TED)
or
the LDC equivalent

See separate entry under "Learner Corpora" below

Tyneside Linguistic Survey (TLS)

Not much info available, but some given on the NECTE page. The TLS corpus was compiled in the late 1960s, & consists of 86 loosely-structured 30-min interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, & were equally divided among various social class groupings of male & female speakers, with young, middle, & old-aged cohorts

Wellington Corpus of Spoken New Zealand English (WSC)

1 m wds of spoken New Zealand English collected from 1988 to 1994 (99% (545 out of 551 extracts) was collected between 1990 to 1994). Of the eight remaining files, four were collected in 1988 (4 oral history interviews) & four in 1989 (4 social dialect interviews). 2,000 word extracts (where possible) & comprises different proportions of formal, semi-formal & informal speech. Both monologue & dialogue categories are included & there is broadcast as well as private material collected in a range of settings. Access to recordings from the WSC is restricted to use at Victoria University of Wellington. A small number of the recordings which are shared with the ICE-NZ corpus will be made available on CD through ICE.

Wellington Language in the Workplace Project Corpus

Not generally available (?). Project aimed to analyse socio-pragmatic norms of interpersonal communication in a wide variety of NZ workplaces, with recordings done as unobtrusively as possible. Volunteers tape-recorded a range of their everday work interactions over a period of time, collecting two-party & multipary meetings, informal work-related conversations, telephone calls, & workplace small talk. Currently (2004) comprises 2000 interactions involving >500 participants, recorded in a number of government departments & commercial white-collar organizations, small businesses, & blue-collar factories. Social talk & business or task-oriented talk, ranging from short telephone calls of <1min to meetings >4 hrs long. Audio recordings are supplemented by detailed on-site ethnographic observations, written agendas & minutes, demographic & organizational info, & video recordings. Contact Janet Holmes at the Victoria University of Wellington, NZ.

British National Corpus (BNC)

Naturally, the spoken component of the British National Corpus is also a rich resource (although for phonetic/prosodic research you'll need to get the audio tapes from the British Library... these are now generally available, but the matching of tapes & actual BNC files is problematic).

The LDC also contains various resources which are not 'corpora' as such, but may be of interest. Example: the LDC American English Spoken Lexicon, which is a collection of pronunciations captured in individual audio files for more than 50,000 of the most common words in English (words were extracted from newswire & telephone conversation; description & links to audio files here), or the West Point Company G3 American English Speech Data, comprising 185 sentences read out by volunteers.

 

Diachronic Comparisons (recent changes in English)

Since the first major English corpora were collected in the 1960s, it is now possible to compare these earlier corpora with more contemporary (1990s) corpora. For written British English, LOB can now be compared with FLOB, while for American English, it's Brown v. Frown. For spoken British English, the Diachronic Corpus of Present-Day Spoken English (DCPSE) allows comparisons of the London-Lund Corpus (LLC, 1960s) with the British component of the International Corpus of English (ICE-GB, 1990s).

More recently, Mark Davies has compiled a Corpus of Contemporary American (COCA) that is continually being updated every 6-9 months--probably the only corpus of English that is suitable for looking at current, ongoing changes in the language.

  [Top]


Historical Corpora or Collections (English)

ARCHER Corpus
(A Representative Corpus of Historical English Registers)

1.8 m wds (so far --May 2009) of British & American English from written & "speech-based" genres sampled from 7 historical periods covering Early Modern English to the present (range: 1650-1990); 1,037 texts; 10 registers (e.g., drama, letters, science prose) representing speech-based, popular, & specialist/academic written registers. Complements the Helsinki corpus. On-going collaborative research efforts are underway to extend the coverage of the corpus with the Universities of Uppsala, Helsinki, Freiburg, Heidelberg, Lancaster, Manchester & Michigan. The corpus is not publicly available, but the several universities involved in the project are willing to host visits by interested scholars.

Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English

a selection of texts from the Old English Section of the Helsinki Corpus of English Texts; contains 106,210 wds of Old Eng text; the samples from the longer texts are 5,000 to 10,000 wds in length; texts represent a range of dates of composition, authors, & genres. For a list of the texts included in the Brooklyn Corpus, click here. The texts are syntactically  & morphologically annotated, & each word is glossed. Size of the corpus: c.12 megabytes.

Century of Prose Corpus

half a m wds of literary & non-literary English; 1680-1780; 120 authors. (Not sure where the Web site is…)

Complete Corpus of Old English

3,022 texts representing all extant Old English texts, compiled at the University of Toronto.

Corpus of English Dialogues, 1560-1760 (CED)

1.2-m wds of Early Modern English speech-related texts (177 text files). The CED contains texts representative of five text types (plus a mixed bag of dialogues labelled 'Miscellaneous'), which divide into two categories: these are 'authentic dialogue', which is written records of real speech events (Trial Proceedings & Witness Depositions), &  'constructed dialogue', in which the dialogue is constructed by an author (Drama Comedy, Didactic Works, & Prose Fiction).

Corpus of Newsbooks

approximately 800,000 wds of running text drawn from all the newsbooks present in the Thomason Tracts that were published from December 1653 to May 1654.

Corpus of Middle English Prose & Verse (CME)

(or visit the parent site, the Middle English Compendium)

collection of Middle Eng texts assembled from works contributed by Univ of Michigan faculty & from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus (archive last updated in October 2000). All 61 texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, & converted to the TEI Lite DTD for wider use. Web-searchable.

Corpus of Early English Correspondence (CEEC) & the Parsed Corpus of Early English Correspondence (PCEEC),

2.7 m wds; 1410 to 1681 (CEES = 450,000 wds); a supplement, the "Corpus of Early Correspondence Supplement (CEECSu; 0.44 m wds) extends the time range: 1402-1663, while the "Corpus of Early English Correspondence Extension" (CEECE; 2.2 m wds) covers the period 1681-1800. The project home page & the manual at ICAME give more details.

Corpus of Early English Medical Writing & Corpus of Middle English Medical Texts (MEMT)

a corpus of medical treatises from 1375-1800. Shorter texts are included in toto & longer treatises are represented by extracts of approximately 10-12 K wds. The medieval section contains about 500,000 wds

Corpus of Late 18c Prose

c.300,000 wds of local English letters on practical subjects, dated 1761-89, as a sample of the English language of the north-west of England in the late Modern English period. These letters, written to Richard Orford, a steward at Lyme Hall in Cheshire, are unselfconscious practical letters, often by uneducated people, on matters of business, farming, mining, & social relations. Available free for ftp download as a single text file or as three linked HTML files for maximum readability.

Corpus of Late Modern English Prose

A 100K-word corpus of informal private letters by British writers, covering the period 1861 to 1919. (Range of dates by birth-date of writer is narrower: 1837-67.) Available from the Oxford Text Archive & through the owner (David Denison).

Corpus of Late Modern English Texts (CLMET)

c.10 m wds; a principled collection of texts drawn from the Project Gutenberg & Oxford Text Archive; Ten m wds of running text, divided over three 70-year sub-periods from 1710-1920.

Corpus of Early American English

English in America from the beginning of the 17th century; compiled in Helsinki.

Helsinki Corpus of Older Scots

830,000 wds; 1450-1700, from fifteen genres.

ETED

(& accompanying book)

Transcriptions of 905 depositions drawn from manuscripts collected from the North, South, East and West Of England, and the London area; c. 267,000 words.  Testimonies by men and women of different ages and walks of life. Five electronic formats (XML, resolved XML, HTML, TXT and PDF) & ETED Presenter, a data retrieval program.

Early English Books On-line (EEBO)

(subscription required)

(images of original print documents, with some now searchable as texts) "From the first book published in English through the age of Spenser & Shakespeare, this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) & Wing's Short-Title Catalogue (1641-1700) & their revised editions, as well as the Thomason Tracts (1640-1661) collection & the Early English Books Tract Supplement."

Helsinki Corpus of English Texts: Diachronic Part

c. 1.5 m wds; 242 files; covers the period from c. 750 to c. 1700 (Old English to Early Modern)

Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)

(1) The Prose Corpus of ICAMET: compilation of 129 texts (March 1999) of Middle Eng prose (1100-1500), digitalized from extant editions & constantly enlarged by further files. Since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. Thus allows literary, historical & topical analyses of various kinds, esp. studies of cultural history. It also invites linguists to raise questions of style, rhetoric or narrative technique, for which one would want a lengthier piece of text or even the complete text.

(2) The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources (written between 1386 & 1688). Particularly encourages pragmatic & sociolinguistic studies, & analyses concerning cultural life & lifestyle.

NEET (Network of Early Eighteenth-century English Texts)

c. 3-million-words, 18th Century English registers. No more information available, but contact Douglas Biber for more details.

Newdigate Newsletters

750,000 wds; manuscript newsletters from 1674-92.

Old Bailey Corpus

Contains the proceedings of the Old Bailey, London's central criminal court, 1674 to 1913. This constitutes a large body of texts from the beginning of Present Day English. The Proceedings contain about 200,000 trials, totalling c.134 million words, of which about 113 million is direct speech. Sociolinguistic mark-up based on sociobiographical speaker data found in the context for about half of the material identified as direct speech is under way (target: 57 million words).

Lampeter Corpus of Early Modern English Tracts

1m wds of English pamphlet literature covering 1640-1740. Text samples are taken from each decade within this century & several genres are represented. Contains the whole text of pamphlets, rather than fragments.

Leverhulme Corpus Project

(Under construnction: 15 months from October 2003)

1-million-word corpus which matches as closely as possible the LOB & FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, & FLOB. This will enable tracking of grammatical change through a period of 60 years of the 20th century. Under construction & as yet unnamed (?)

Penn-Helsinki Parsed Corpus of Middle English

prose text samples of Middle Eng, annotated for syntactic structure. Designed for the use of students & scholars of the history of English, especially the historical syntax of the language

TIME Magazine Corpus

100-m wds from TIME magazine, 1923-2006. Allows you to see how wds & phrases have increased or decreased in usage & or changed meaning over time.

Women Writers Online

The Brown University Women Writers Project's main undertaking is an SGML-encoded full-text database of pre-Victorian women's writing in English (at present, it covers 1400 to 1850). This collection currently includes nearly 200 texts representing a broad cross-section of the literate culture of pre-Victorian Britain.

York-Toronto-Helsinki Corpus of Old English Prose (YCOE)

1.5 million word syntactically-annotated corpus of Old English prose texts; sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (uses the same form of annotation & is accessed by the same search engine, CorpusSearch). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive. Free for non-commercial use.

York-Helsinki parsed corpus of Old English poetry

a selection of poetic texts from the Old English Section of the Helsinki Corpus of English Texts; 71,490 wds of Old English text; the samples from the longer texts are 4,000 to 17,000 wds in length. The texts represent a range of dates of composition & authors. For a list of the texts included in the York Poetry Corpus, click here. The texts are syntactically & morphologically annotated.

Zürich Corpus of English Newspapers (ZEN)

London newspapers from 1660s to the beginning of the 20th century. Contact: Udo Fries

* See also the Early Modern English Dictionaries Database (EMEDD description here)

 [Top]


Corpora for research on 1st language acquisition

Child Language Data Exchange System

(CHILDES)

XML database here

c.20 m wds (180m characters), 20 languages. The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,& systems for linking transcripts to digitized audio & video. Includes a language acquisition bibliography

Lancaster Corpus of Children's Project Writing (LCPW)

a digitized collection of project work produced by children aged between 8 & 11; part of a larger research program (a longitudinal study of children's writing-for-learning, based on the writing of 8-12 year old children)

Polytechnic of Wales (POW) Corpus

100,000 wds spoken English by 120 children, aged 6-12; parsed according to Hallidayan Systemic-Functional Grammar. See the manual here. Distributed from two places: The Oxford Text Archive orgainsed by Lou Burnard. & ICAME in Bergen, Norway (icame@hd.uib.no) organised by Knut Hofland. The AMALGAM tagger emulates the POW tagset

 

Learner Corpora, Lingua Franca Corpora (for various languages / 2nd Lg Acquisition research)

(Language produced by  non-native speakers/writers)

* See Yukio Tono's Learner Corpora Resources web page for a more comprehensive index to learner corpora web sites (e.g. the various ICLE projects for learner English, such as SWICLE (Sweden), BRICLE (Brazil) & PICLE (Poland)), plus a useful bibliography on learner corpora

International Corpus of Learners' English

(ICLE)

As of May 2009, over 3.7 m wds of writing by advanced/university learners of English (EFL, not ESL) from 25 different mother tongue backgrounds (e.g. Arabic, Brazilian Portuguese, Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Norwegian, Hungarian, Italian, Japanese, Polish, Russian, Spanish, South African, Swedish, Turkish). Two types of essay writing: (1) Argumentative essays (untimed); using language reference tools (dictionaries, grammars, etc.) but entirely the students' own work, i.e. no quoting, no native speaker help; (2) Literature examination papers (no more than 25% of each national corpus). Each Essay: between 500 to 1,000 wds long. In May 2009, there were 5,554 argumentative essays & 531 literary or 'other' essays.

The ICLE corpus is now available for purchase (CD-ROM, version 1.1.) from i6doc.com here.

International Corpus of Crosslinguistic Interlanguage (ICCI)

an international effort headed by Yukio Tono at Tokyo University of Foreign Studies (TUFS). Two aims: (1) to compile corpora of the writing of young learners of English across different proficiency levels (from primary up to pre-university) and first language backgrounds (different mother tongues). There are currently 10 scholars from 8 different countries/regions (Hong Kong, Germany, Israel, Japan, Poland, Singapore, Spain, and Taiwan) actively contributing to this project. (2) to compile TUFS students’ L2 writing in the partner country’s languages (e.g., essays in Spanish written by Japanese learners of Spanish).

LINDSEI

(Louvain International Database of Spoken English Interlanguage)

a corpus of spoken learner English from learners from 11 different language backgrounds (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, & Swedish to date). Two types of speech: informal interviews (free talk on a given topic) and picture-prompted speech (based on a standard set of pictures). There is a comparable corpus of speech from English native speakers called LOCNEC (Louvain Corpus of Native English Conversation).

Chinese Learner English Corpus (CLEC)
(An associated web site with slightly outdated info is
here)

1 m wds of Eng compositions collected from 5 different levels of Chinese learners of Eng, tagged according to an error tagging scheme of 61 types of error (excludes stylistic errors & error sources, which are difficult to tag objectively & consistently); consists of a book & CD-ROM. The book has an introduction (in Chinese) which gives an account of the corpus design, the methodology used in the statistical analysis of the corpus, and the major findings, + an Alphabetical List, a Lemmatized List, a Word-Frequency Distribution, a Summary Table of Errors, & a List of Spelling Mistakes. The CD-ROM consists of the error-tagged corpus with a simple concordancer, & all the lists & tables of the book. Another companion to CLEC known as Analysis of Chinese Learner Errors in Eng is forthcoming. Available by mail: Shanghai Foreign Language Education Press, 295 Zhong Shan Bei Yi Road, Shanghai 200083, PRC. Contact Mrs. Fan Jianying, email sflep@sflep.com.cn Fax(86)021-55512177. List priceIn PRC76.00plus 15% postage; Overseas: US$60.80(including postage) For further information, please contact Professor Gui Shichun (itscgui@gdvnet.com)

SWECCL (The Spoken & Written English Corpus of Chinese Learners)

 

Lancaster Corpus of Academic Written English (LANCAWE)

academic writing samples from non-native speakers of Eng taking study Skills/EAP pre-sessional & undergrad courses. There is also a small native speaker subcorpus that can be used for comparison. Some sub-corpora are organised according to writing task & topic, writer's L1, writing conditions & time at which the piece was produced; contains more than one piece of writing from each learner, & these comprise similar essays written by the same learner at different points in time (e.g., before, during & after the pre-sessional course), as well as different types of essays (e.g., descriptive, argumentative, etc.) written by the same learner at the same or different times. A  longitudinal sub-corpus of LANCAWE is called the Hinestroza-Kim Corpus (HKC).

MELD (Montclair Electronic Language Learners' Database)

English (ESL) text written by all levels of learners in North America; publicly available; timed & untimed writing of undergrad ESL students, dated so that progress can be tracked over time. Demographic data is also collected for each student, including age, sex, L1 background, & prior experience with English. The essays are continuously being tagged for errors in grammar & academic writing as determined by a group of annotators. The database currently (May 2009) consists of 44,477 wds of tagged text & another 53,826 wds of text. Allows various analyses of student writing, from assessment of progress over time to relation of error type & L1 background. Errors are annotated independently by two trained annotators without reference to a pre-determined list of error types. The error annotation is then adjudicated by the two annotators in consultation with one of the project directors.

ELFA
(English as a Lingua Franca in Academic Settings)

recordings & transcripts of spoken English used as a lingua franca in academic settings (Tampere University & Tampere Technological University in Finland). Sessions with speakers who all share an L1 are not included, neither are Eng lg courses. Coded for speech event type/genre, discipline/domain, interaction type (dialogic/monologic), age group, gender, nationality & mother tongue.

VOICE
(Vienna-Oxford International Corpus of English)

a corpus of English as a Lingua Franca (i.e., English as the means of communication regarded as the most convenient one by speakers from different first-language backgrounds). The focus is on unscripted, largely face-to-face communication among competent speakers from a wide range of L1 backgrounds whose primary & secondary education & socialization did not take place in Eng. Speech events include private & public dialogues, private & public group discussions & casual conversations, & one-to-one interviews. An on-line search for VOICE 1.0 Online is available (pre-registration required).

FRIDA (French Interlanguage Database)

a corpus of French as a foreign language, with a target size of 450,000 wds..

EnglishTLC

(English Taiwan Learner Corpus)

c. 2 m wds of unrestricted running text written by learners of English in Taiwan (majority by senior high school & university students). Essentially a self-propogating corpus: EnglishTLC is integrated with the writing cpmponent of a web-based English learning platform called IWiLL. Partially annotated for errors, consisting of comments made by teachers in their everyday process of correcting essays online using the IWiLL essay correction interface (the comments provide a window onto actual teacher feedback & teaching practice). The research interface provides a search function for extracting every error token marked by teachers on essays in the corpus. This function then lists all comments in descending order of the number of instances marked as tokens of that error type. Then each comment in this list links to a listing of all of the sentences in EnglishTLC that have been marked as that error type. Since teachers are selective in the errors which they mark in student writing, this sort of annotation in EnglishTLC should be regarded as partial annotation. There are devised heuristics for bootstrapping from these partially annotated texts to the extraction of further error tokens that the teachers left unmarked (see Wible et al 2003 for details). Feedback effects are traceable. The errors that teachers have marked as feedback to the students are also indexed to any revisions the learner may have made to their essay after reading that teacher feedback. This makes it possible to uncover learners’ attentiveness to or grasp of comments given.

Hong Kong University of Science & Technology (HKUST) Corpus

the biggest corpus of Chinese (Cantonese) learners of English (or, indeed, of any single group of learners of English). 25 m wds, with grammatical & discourse-feature tags. Texts consist of written undergrad assignments & 'A-level' scripts. Contact: Gregory James, Language Centre, Hong Kong University of Science & Technology, Clear Water Bay, HK. See Milton, John & K.S.T Tong (eds.) (1991). Text Analysis in Computer-Assisted Language Learning. Hong Kong: Hong Kong University of Science & Technology. See also: AUTOLANG & WORDPILOT (corpus-based tutor; shareware)

Learner Business Letters Corpus

209,461 word tokens in 1,464 letters written by Japanese business people. Searchable through a web concordancer here. More details about the collection, constitution, etc. of the corpus can only be found by browsing through Someya's M.A. dissertation available on-line here.

European Science Foundation Second Language Databank

a computerized archive of the spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, & their communication with native speakers in the respective host countries (France, Germany, Great Britain, The Netherlands & Sweden). For each target language, two source languages were selected.

Hungarian Learner English (JPU Corpus)

Hungarian university students' English 

ISLE database

(Interactive Spoken Language Eduation)

 [not really a "corpus" as such]; database of non-native English created to help train & test the ISLE automatic pronunciation tutor system; approx. 20 minutes of speech (per speaker) from 23 German & 23 Italian intermediate learners of English. Each speaker recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions.) The prompts were of varying perplexities. About 2/3 of the data for each speaker was annotated by one of a team of linguists. The files were corrected first at the word level, & an automatic recognizer was then used to produce phone-level annotations. The annotator then re-annotated each sentence to mark phone & stress errors (e.g., substitutions, insertions, or deletions.) It comprises: a total of 46 speakers (23 German & 23 Italian.) 11484 utterances 1.92 gigabytes of WAV files 17 hours, 54 minutes, & 44 seconds of speech data It is distributed on 4 CD-ROMs. Contact ELRA for purchasing information.

PICLE

The Polish component of ICLE. This corpus, along with some comparable English (undocumented) & Polish corpora, can be searched on-line using various tools provided.

SILS Learner Corpus of English

(Waseda Univ)

essays by students at SILS, the School of International Liberal Studies at Waseda Univ, Japan; wide variety of backgrounds (majority Japanese); can be used to look at the effects of native lg and educational background on writing skills in English; Will be collecting many essays from each indiviual student (longitudinal), and both 1st and 2nd drafts, with teachers' comments.

Thai English Learner Corpus (TELC)

No Web site I can find, but related site is here

written corpus of 1.3 m wds (on 23/5/2002), tagged for part of speech & lemma. Comprises writing samples of Thai EFL university students, starting 1997 & continues to grow. 700,000 wds of written Eng taken from university entrance exams at the Institute for English Language Education (IELE, Assumption University, Thailand) & 600,000 wds from essays written by fourth year Thai EFL learners at the Institute. Searchable on-line, but limited to 100 concordance lines. For full access, contact the owners.

Tswana Learner English Corpus (TLEC)

modelled on ICLE; corpus of 200K wds of argumentative essays from advanced learners of English in institutions of higher learning in South Africa.

BACKBONE (European languages, incl. English) On-line Search here.

BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).

Translanguage English Database (TED)
or
the LDC equivalent

a corpus of recordings made of oral presentations given in English by non-native speakers of Eng at Eurospeech'93 in Berlin. 224 oral presentations providing about 75 hours of speech material. These recordings provide a large number of presenters, speaking multiple variants of English, over a relatively large amount of time (15 minutes for each presentation + 5 minutes of discussion), on a specific topic. This release of TED (6 CDROMs) includes 188 speeches, without the ensuing discussion periods. Associated text materials = ASCII versions of over 400 proceedings papers & oral preparations that were supplied by the authors, as well as, 250 speaker questionnaires. There is also an associated TED Transcripts corpus (or LDC equivalent here) containing transcriptions of 39 of the 188 speeches, in Universal Transcription Format (UTF). All utf files in the transcript publication were validated against an included utf.dtd. Tables containing speaker demographic information & a cross-reference of file names from the TED audio corpus are included.

Longman Learners' Corpus

Not generally available (except by arrangement with publishers). Students & teachers throughout the world sent in essays & exam scripts to help create the Longman Learners' Corpus, a 10-million word computerised database made up entirely of language written by students of English. Every nationality, every language level is represented in the corpus & this provides a unique insight into learner English.

VALICO (Varietà di Apprendimento della Lingua Italiana: Corpus Online)

(Online Corpus of the Learning Varieties of the Italian Language); texts encode a set of sociolinguistic data to determine the learners' profiles (learners' age, gender, proficiency in Italian, knowledge of other languages, mother tongue…).Learners were given a common stimulus to elicit the texts, to allow comparisons across countries.

International Corpus of Learner Finnish (ICLFI)

timeline: 2008-2011. Spotaneous texts produced by learners of Finnish around the world.

Cambridge Learner Corpus

Not generally available. A large collection of examples of English Writing from learners of English all over the world; over 15 m wds & expanding all the time;part of the Cambridge International Corpus (CIC); comes from anonymised exam scripts written by students taking Cambridge ESOL English exams around the world; each script is coded with information about the student's first language, nationality, level of English, age, etc. Currently, it can only be used by authors & writers working for Cambridge University Press & by members of staff at Cambridge ESOL.

 [Top]

 


Specialised Corpora of English (specific dialects, genres, registers)

Many of these are suitable for ESP teaching, learning & research

Acknowledgements Corpus

four sub-corpora: (1) 32 examples of 'The author's acknowledgements' in published books (2) 6 examples of 'The publishers' acknowledgements in published books (3) 5 examples of 'Acknowledgements in research articles' placed as footnotes in the papers (4) 6 examples of 'Acknowledgements in research articles' placed just before the references. The associated guide to writing acknowledgments is here.

ACL/DCI CD-ROM disk

about 63 m wds of plain orthographic English collected by the Association for Computational Linguistics' Data Collection Initiative; consists of: the Collins English Dictionary; selections from the Wall Street Journal (40m wds); a database of scientific abstracts from the U.S. Department of Energy (23m wds); the `Penn Treebank' of skeleton-parsed data compiled by Mitch Marcus & his team at the University of Pennsylvania (Marcus & Santorini, 1992).

 Air Traffic Control (ATC) Corpus

70 hours of recorded conversation between controlers & aircrafts in three major airports of the United States; 3 subcorpora corresponding to each one of the three airports; each subcorpus consists of 20-25 hours of data, representing continuous recording without silence elimination. The speech files are fully transcribed, with time marking indicating beginning & end of transmission.

American Heritage Intermediate (AHI) Corpus

5.09 m wds; based on a 1969 survey of US schools; 10,043 samples, each 500 wds long, from publications which were widely read among American schoolchildren aged 7 to 15 years. See: Carroll, John, Peter Davies & Barry Richman (1971) (eds.) Word Frequency Book. New York: American Heritage. Web version (plus other data) via the Univ. of Michigan Digital Library Production Service (AHD, 3rd edition) Restricted Access

Asian Newspaper English

(Hong Kong University; old URL was here)

A web-based concordance is derived from a corpus of 114,502 wds (13,971 types) from English-language newspapers in 18 Asian countries, dated September-November 2000, inclusive. Compiled for teaching & demonstration purposes only, & should not be seen as a representative sample, & the texts may not be re-distributed in any form.

BASE  
(British Academic Spoken English)
(for developments since 2007 click
here.)

The British analogue to MICASE. A corpus of university lectures & seminars developed at the Universities of Warwick & Reading, under the directorship of Hilary Nesi, with Paul Thompson. Recordings & transcriptions of 160 lectures & 39 seminars in a range of departments, at both undergraduate & postgraduate level (1,644,942 tokens in total). Transcriptions, video & audio recordings have been archived by the Arts & Humanities Data Service. One method of access is through the Open SketchEngine interface.

BAWE 
(British Academic Written English)
(for developments since 2007 click
here.)

A corpus of good-quality student assignments across disciplines, from first year undergrad to masters level, developed at the Universities of Warwick, Reading, Oxford Brookes & Coventry, under the directorship of Hilary Nesi, with Paul Thompson, Sheena Gardner & Paul Wickens. 2,761 assignments from 627 student contributors in 33 university departments, totalling 2896 independent texts (6,514,776 wds). Corpus development was funded by the Economic & Social Research Council. (2004-2007). The corpus will be available to researchers from the Arts & Humanities Data Service & the ESRC Data Archive. One method of access is through the Open SketchEngine interface.

Blog Authorship Corpus

he collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Boston Directions Corpus

American English, task-oriented monologues, both read & spontaneous; multiple non-professional speakers who were given written instructions to perform a series of increasingly complex direction-giving tasks. No known web site, & probably not generally available. More information available in publications based on this corpus, such as this one.

 Business Letters Corpus

Someya's corpus of Business Letters ( 1,020,060 word tokens of U.S. & U.K. samples, as of 1 March 2000). (Also searchable (separately): a non-native English corpus of business letters written by Japanese business people -- see below.) More info & a web concordancer on this page: Online BLC Concordancer 

Carnegie Mellon Communicator Corpus (details here)

a large corpus of speech produced by callers to a Travel Planning system; around 180,605 utterances (90.9 hours) in 2002.

CHAINS corpus 

(CHaracterizing INdividual Speakers)

a novel speech corpus which may be of interest into those looking at diverse speaking styles, & those seeking to characterize speaker identity; features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, & at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. Free for research purposes.

Circle Archive

(old site was here)

a collection of (mostly freely downloadable) transcripts of tutorial sessions. [Used mainly by researchers in education, psychology & cognitive science, I believe. All  were collected in the USA (as far as I know) from high school & college/university settings, so they represent different contextualised varieties of spoken American English. -- DL] The archive is grouped by "corpus", where a "corpus" is a set of transcripts collected by the same investigator, where the tutoring sessions were on roughly the same topic, with similar tutors & students. [None of the transcripts are formatted or structured in the way modern-day corpora are, & all are in either HTML or plain text format. -- DL]  (* comments in square brackets represent my own views/understandings of the data.)

Coconut Corpus

a collection of human-human computer-mediated dialogues in which two subjects collaborate on a simple task, buying furniture for the living & dining rooms of a house

COLT

(Bergen "Corpus Of London Teenage Language")

spoken language of 13 to 17-year-old teenagers from different boroughs of London; half a m wds, orthographically transcribed & word-class tagged, & is a constituent of the British National Corpus; A pilot-version consisting of 151 texts is now available on the Internet. For registered users, the search program can also show the distribution of an item in relation to factors such as age, sex, socioeconomic class, location etc. Search LOB/COLT with IMS Corpus Work Bench 

Corpus of High School writing (Australian)

developed to investigate the acquisition of writing skills among High School students in Australia; 3,686 essays by NSW high school students, written for the 1993 Sydney Morning Herald Young Writers Competition, totalling c. 160,000 words. Source: 227 schools scattered all over the state, and from all six years of secondary schooling (though fewer from Year 11 and 12 students). An extensive annotation system was devised, with four compartments: spelling, word forms, syntax and punctuation.

Corpus of Professional English (CPE)

a major research project of PERC (Professional English Research Consortium) currently underway that, when finished, will consist of a 100-million-word computerized database of Eng used by professionals in science, engineering, technology, law, medicine, finance & other fields.

Corpus of Spoken Professional American-English (CSPA)
(Commercial product by Athelstan)

2-m-word part-of-speech tagged corpus consisting of transcripts of American Eng spoken in professional settings (committee meetings, faculty meetings & White House press conferences); recorded from 1994-1998; consists primarily of short interchanges by approximately 400 speakers that are centered on professional activities broadly tied to academics & politics, including academic politics; seventeen files (12 MB).

Corpus of Written British Creole (CWBC)

Mark Sebba's project. Some introductory notes about the corpus here.

CorTec - Technical Corpus

(click on "CorTec" on the left menu)

a bilingual (English & Portuguese) comparable corpus of technical language (linked to the COMET project) in 5 areas: Cooking, Contracts, Computing, Environment & Hypertension. For copyright reasons, the corpora themselves cannot be accessed, but they can be searched with the tools provided: Concordancer, wordlist & N-gram extractor.

Dialogue Diversity 'Corpus' (DDC)

Not, technically speaking, a 'corpus' as such, but a collection of links to different dialogue texts (transcriptions &/or sound files), covering a very diverse collection of interactive situations--a data resource for studies of the breadth of coverage of particular dialogue models, & for studies that compare dialogue from different situations. Taken as a whole, this 'corpus' is irregular & not homogeneous in any way. It is generally unsuitable for drawing any conclusions about dialogue taken as a single category.

Enron Email Dataset/Corpus

collected & prepared by the CALO Project (A Cognitive Assistant that Learns & Organizes). Contains e-mails from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, & posted to the web, by the Federal Energy Regulatory Commission during its investigation. Does not include attachments, & some messages have been deleted "as part of a redaction effort due to requests from affected employees", & some email addresses were anonymized. Probably the only substantial collection of "real" email that is public (because of privacy concerns). In using this dataset, please be sensitive to the privacy of the people involved (& remember that many of these people were certainly not involved in any of the actions which precipitated the investigation).

Jiao Tong University Corpus for English in Science & Technology (JDEST)

1 m wds (2,000 units of about 500 wds each) from written English texts from the physical sciences, engineering & technology (divided into the following ten subject areas: Computers, Metallurgy, Machine Building, Physics, Electrical Engineering, Civil Engineering, Chemical Engineering, Naval Architecture, Atomic Energy, Aircraft Manufacturing). Randomly selected from theses, textbooks, academic works, popular science & science digests, published in the UK, the US & other countries. [No web page that I know of.]

Hyland's Research Articles Corpus

Ken Hyland's personal corpus of published research articles, representing written academic English. Not available to the general public, but contact owner directly for more info. Consists of 30 texts each from 8 disciplinary areas (biology, engineering, mechanical engineering, linguistics, marketing, philosophy, sociology, physics), totalling 1.3m wds.

Lancaster Corpus of Academic Written English (LANCAWE)

academic writing samples from non-native speakers of Eng taking study Skills/EAP pre-sessional & undergrad courses. There is also a small native speaker subcorpus that can be used for comparison. Some sub-corpora are organised according to writing task & topic, writer's L1, writing conditions & time at which the piece was produced; contains more than one piece of writing from each learner, & these comprise similar essays written by the same learner at different points in time (e.g., before, during & after the pre-sessional course), as well as different types of essays (e.g., descriptive, argumentative, etc.) written by the same learner at the same or different times. A  longitudinal sub-corpus of LANCAWE is called the Hinestroza-Kim Corpus (HKC).

Leipzig Corpora Collection

(various languages)

corpora in different languages using the same format & comparable sources (identical in format & similar in size & content). Randomly selected sentences. Available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc.. The sources are either newspaper texts or texts randomly collected from the Web. All data (publicly accessible, copyrighted sources) have been processed automatically so that it is not possible to reconstruct the original source texts. Significant L1, R1 & "within sentence" collocates are computed for each word. Available as plain text files, or as MySQL database tables (ready to use with a supplied Corpus Browser)

Leuven Drama Corpus

compiled at the Institute of Applied Linguistics of the Catholic University of Leuven (L. K. Engels & Dirk Geens). 62 British English plays first published during 1966-1972. 1 million words.

LOCNESS (Louvain Corpus of Native English Essays)

a corpus of native English essays made up of: British pupils' A level essays (60,209 wds), British university students essays (95,695 wds), American university students' essays (168,400 wds). Total: 324,304 wds

METER Corpus (MEasuring TExt Reuse)

collected from British PA (Press Association) archive & 9 British national newspapers; 528,563 wds from the two journalistic domains of 'Law & Courts' & 'Show Business'; project aim was to develop techniques for detecting & measuring text reuse (mapping derived texts to their source texts, indicating the probability of derivation). One CD-ROM (free)

MICASE (Michigan Corpus of Academic Spoken English (American English).
(Compare with the
BASE corpus above)

a free (& web-accessible) spoken American Eng corpus of c.1.7 m wds (190 hours of recordings) focusing on contemporary university speech within the microcosm of the Univ of Michigan. Has a free-to-use accompanying web concordancer/search engine that can search by speaker or speech event attributes. Speakers include faculty, staff, & all levels of students (mostly native, some non-native speakers) across several speech events (incl. monologic & interactive speech) from all of the major academic divisions (with the exception of the professional schools, i.e., medical, dental, business, & law). 15 different types of speech event: small/large lecture, public interdisciplinary or departmental colloquia, discussion sections, student presentations, seminars, undergraduate lab sessions, lab group & other meetings, one-on-one tutorials, office hours, advising consultations, dissertation defenses, study groups, interviews, campus/museum tours, & service encounters. Full transcripts can be ordered for a nominal fee (XML format). Some audio recordings of the original speech events are available here (streaming Realaudio), or in other formats by special arrangement to bona fide researchers. A manual giving more detailed information about the corpus is here.

MICUSP (Michigan Corpus of Upper-level Student Papers)
(under construction)

MICUSP (the Michigan Corpus of Upper-level Student Papers); 1.6 m wds; assessed genres of writing by senior undergraduate (4th year) & graduate students in the US (native & non-native speakers of English); length of the texts ranges from 500 to 10,000 wds; being developed at the University of Michigan’s English Language Institute.

MuchMore Springer Bilingual Corpus

a parallel corpus of English-German scientific medical abstracts; c. 1 million tokens for each language. Abstracts are from 41 medical journals, each of which constitutes a relatively homogeneous medical sub-domain (e.g. Neurology, Radiology, etc.). The corpus of downloaded HTML documents was normalized in various ways in order to produce a clean, plain text version consisting of a title, abstract and keywords. Additionally, the corpus was aligned on the sentence level.

NIE Corpus of Spoken Singapore English

(NIECSSE)

aims to provide high-quality recordings of Singaporean speakers. The aim of the corpus is to facilitate acoustic/phonetic analysis of Singapore English. In order to eliminate background noise & thereby facilitate acoustic/phonetic measurement, all recordings were made directly onto the computer in the NIE Phonetics Laboratory. Consists of interviews & a read text.

Nijmegen Corpus & TOSCA Corpus (Tools for Syntactic Corpus Analysis)

Nijmegen Corpus: 132,000-word syntactically analysed corpus of written (120,000 wds) & spoken (12,000 wds of sports commentaries) modern British English ; 20,000-word samples of fiction & non-fiction from 1962-68.; TOSCA Corpus: 1.5 m wds (75 samples x 20,000 wds each)  syntactically analysed; texts from 1976-86. Used via the Linguistic DataBase (LDB), a database program created by the TOSCA corpus linguistics group at Nijmegen Univ for the storage & exploration of syntactically analysed texts. Features a tree viewer & an extensive query language.

Oxford Psycholinguistic Database

comprises 98,538 English wds & information on the spelling, syntactic category & number of letters for each of these as well as information on the phonetics, syllabic count, stress patterns & various criteria affecting comprehension. See also notes on the use of a psycholinguistic database by S. Devlin & the MRC psycholinguistc database (searchable web version)

Reading Academic Text corpus (RAT)

c. 1 m wds composed of twenty research articles written by Reading University academic staff, & a small set of PhD theses, written by, & contributed by, successful doctoral candidates in the Faculty of Agriculture. Restricted Access.

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19

See separate entry below under D-I-Y corpora

RST Discourse Treebank

[For Discourse Analysis, message understanding, etc.] 

a selection of 385 Wall Street Journal articles from the Penn Treebank which have been annotated with discourse structure in the framework of Rhetorical Structure Theory (RST). In addition, the corpus includes a number of humanly-generated extracts & abstracts associated with the original documents.

Saarbruecken Corpus of Spoken English (ScoSE)

Freely downloable. Has seven parts: Part 1: Complete Conversations;  Part 2: Indianapolis Interviews; Part 3: Jokes; Part 4: Drawing Experiment; Part 5: Kassel Classroom Discourse; Part 6: Stories; Part 7: London Teenage Talk

Transcripts & audio can be downloaded from the TalkBank site, and some can be heard & read at the same time (as multimedia presentations) through any browser from the TalkBank browser page (click on "CABank", then on "SCoSE", then on one of the transcripts, then press the "play" button for Quicktime).

SCoRE

[Singapore  Corpus  of  Research  in  Education] 

a multimodal corpus database of "education discourse" in Singapore schools; classroom interactions, teaching materials & students' assignments in Singapore primary & secondary schools; attempts to annotate at different linguistic & discourse levels. The proposed deliverables include a speech subcorpus, a lexical subcorpus, & several multilevel annotated subcorpora at different development stages.

Scottish Corpus of Texts & Speech (SCOTS)

contains documents in Scottish Standard English, documents in several varieties of Scots, & everything in between. While Scottish Standard English has a standard written form, Scots does not. This means that the corpus contains a wide range of spelling variation (steps being made to offer a means of searching for all of the variant spellings automatically in a later stage of the project). SCOTS is a publicly available resource on the Internet.

SLX Corpus of Classic Sociolinguistic Interviews

8 sociolinguistic interviews, 9 speakers. William Labov & one of his students conducted the interviews in the 1960s & 70s. These interviews represent solutions to the problems of achieving cross-cultural contact, reducing the effect of the Observer's Paradox & approximating the vernacular of everyday life. Complete interview recordings plus time-aligned verbatim transcripts for each speaker. Also included: (i) a sociolinguistic variable survey that represents an overview of the intra- & inter-speaker variation attested in the corpus, highlighting a broad range of phonological, phonetic, grammatical, lexical & stylistic variables. (ii) a number of annotation tools that allow users to listen to each interview while browsing the corresponding transcripts, & to display & hear each token identified in the variable survey. The recordings demonstrate successful interviewing techniques, the sound quality is high, & the digitization, segmentation & transcription of the data represent best practice in these areas. The variable survey highlights over 150 sociolinguistic variables attested in the corpus & suggests avenues for further research. Most importantly, the SLX Corpus provides both an example of a digital speech corpus developed specifically to support sociolinguistic research, & a stable benchmark for training in sociolinguistic data collection, digitization, segmentation, transcription, analysis & publication. 17 speech files (22050Hz, 16 bit, single-channel in the MS WAV (RIFF) format), total of 575 minutes (~ 1.5GB); DVD-ROM.

Speech, Thought & Writing Presentation Corpus (STWP)

A corpus of around 250,000 wds annotated for categories of speech, thought & writing presentation; genres included: fiction, newspaper reports, biographies/autobiographies.  Available through the Oxford Text Archive.

SRI American Express travel agent dialogue corpus

a large corpus of actual travel agent interactions with client callers

Switchboard Corpus (LDC)

See separate entry above

TimeBank

a set of 186 news report documents annotated with the 1.1 version of the TimeML standard for temporal annotation. This release should also include a copy of the TimeML schema version 1.1..

Translational English Corpus (TEC)

contemporary translational English: written texts translated into English from a variety of source languages, European & non-European. Supports a broad range of studies in two main areas: the way in which the patterning of translated text might be different from that of non-translated text in the same language, & stylistic variation across individual translators. Set up & currently managed by Mona Baker.

TIMIT Acoustic-Phonetic Continuous Speech Corpus

"read speech" designed to provide speech data for the acquisition of acoustic-phonetic knowledge & for the development & evaluation of automatic speech recognition systems; contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences

T2K-SWAL Corpus (The TOEFL 2000 Spoken & Written Academic Language Corpus)

[Owned by the Educational Testing Service (ETS), USA. NOT publically available.] 2.8 m wds; 490 texts; 8 spoken & written registers (e.g., classroom teaching, study groups, textbooks) taken from 6 academic disciplines at four US universities; designed to represent the range of spoken & written registers that students will regularly encounter in university life. Part-of-speech-tagged. No web page available, but Biber's NAU page gives the brief description reproduced here, & the following articles gives more detailed info: Biber, D., S. Conrad, R. Reppen, P. Byrd, & M. Helt. 2002. Speaking & writing in the university: A multi-dimensional comparison. TESOL Quarterly, 36(1):9-48.

Wolverhampton Business English Corpus

[description & purchasing information from ELDA]

10,186,259 wds in the general domain of business, collected from 23 different web sites around the world (from six months within the period 1999-2000), covering a wide variety of categories including product descriptions, company press releases, annual financial reports, business journalism, academic research papers, political speeches & government reports. POS-tagged. Alternatively, on this site, you can see & compare frequency lists & ngrams for various subcorpora/text genres (including business texts).

·       PLUS hundreds of others available from the Linguistic Data Consortium (LDC) & ELRA/ELDA catalogue. However, the 'corpora' in these catalogues which are not listed on this site are mostly specialised collections/small corpora of isolated sentences (hence not really text corpora but collections of sentences). You could also try querying the OLAC archives


 

Text Archives & Corpus Distribution Sites (various languages)

(see also the D-I-Y corpora section)

Alex Catalogue of Electronic texts

a collection of digital documents collected in the subject areas of English literature, American literature, & Western philosophy. Basic concordancing & browsable, downloadable full texts.

American Memory

a gateway to rich primary source materials relating to the history & culture of the United States. The site offers more than 7 million digital items from more than 100 historical collections (some as images of documents, some in text format).

Bavarian Archive for Speech Signals (BAS)

makes databases of spoken German accessible in a well structured form to the speech science community as well as to speech engineering

Electronic Text Center 

(University of Virginia)

combines an on-line archive of tens of thousands of SGML & XML-encoded electronic texts & images with a library service that offers hardware & software suitable for the creation & analysis of text. SGML texts are converted to HTML when you select them in your web browser. Has texts in English (Middle & modern), German, French, Latin, Apache, Japanese, Chinese, etc.

Oxford Text Archive (OTA)

"holdings include electronic editions of works by individual authors, standard reference works such as the Bible & mono-/bilingual dictionaries, & a range of language corpora"; "electronic texts & corpora of interest not only to literary textual scholars, but also those working in linguistics, history, law, modern & ancient languages, indeed almost any humanities discipline which relies upon a close reading of texts."

Project Gutenberg

(or mirror site here)

books published pre-1923, anything out of copyright; e.g. Shakespeare, Poe, Dante, Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan & Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, & thousands of others.

String frequency reports for 5400+ books (400M wds) from Project Gutenberg available at Ronald Reck's site (but read this Corpora List message for details)

ELDA (European Language Resources Distribution Agency)

the distribution arm of  ELRA (European Language Resources Association). Has a searchable catalogue covering their speech resources, written corpora & terminological resources.

ICAME (International Computer Archive of Modern & Medieval English)

Collects & distributes information on English language material available for computer processing & on linguistic research completed or in progress on the material. The ICAME CD-ROM (20 different corpora, totalling > 17 m wds) contains most of the important English Language corpora used in research.

TRACTOR

TELRI Research Archive of Computational Tools & Resources (TELRI = Trans-European Language Resources Infrastructure); Corpora in 20 languages; Parallel corpora in a variety of pairings; Software for processing corpus evidence; Lexicons & other language-information resources.

Linguistic Data Consortium (LDC)

supports language-related education, research & technology development by creating & sharing linguistic resources: data, tools & standards. Has lots of specialised corpora for many languages (most of them, however, intended for NLP).

OLAC
(Open Language Archives Community)

has a search facility covering the resource catalogs of LDC, ELRA & the ACL/DFKI Natural Language Software Registry, & permits single searches to be applied to all catalogs simultaneously. The OLAC cross-archive search engine now harvests 11,000+ records from 12 OLAC archives. Try it out using the query box in the top right corner of the web page OR the more advanced search facility hosted by the Linguist List.

OLAC is an international partnership of institutions & individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, & (ii) developing a network of interoperating repositories & services for housing & accessing such resources. OLAC was founded at the Workshop on Web-Based Language Documentation & Description, held in Philadelphia in December 2000.

RELATOR (European Linguistic Resources Repository Network)

a CEC-funded initiative which adresses the vital area of linguistic resources for spoken & written language processing


Non-English, Parallel & Multilingual Corpora

(Click here for separate page )

 

 

The Web as a Corpus

(Concordancing current/'live' pages on the Web rather than your own local texts)

Warning: Quality control issues: (1) Not everything on the Web is the kind of language you will want to learn/emulate (many native speakers of English write (& type) English badly); (2) non-native speakers of English put up web pages too; (3) different varieties of language suit different genres & purposes, & most search engines are not genre-aware; (4) Search engines such as Google give different results on different days, & have gaps, omissions & inclusions that are hard to explain (due to copyrighted, proprietary technology).  So…use the tools listed below with caution. The positive point about searching the Web? It's one of the few places where many recently-coined words, jargon & slang can be found ‘in print’, & nowadays many innovations in the language, particularly technical or computing-related terms, appear on-line first/only on-line.

KWiCFinder & WebAsCorpus.org Web Concordancer 
(by Bill Fletcher)

KWiCFinder (Key Word in Context Finder) is a free stand-alone Web search concordancer optimized for multilingual searches. It builds on Yahoo! search engine support for complex Boolean searches. Displays the search words in their textual contexts. 

In contrast, Web Concordancer complements Google.com's popular search engine to simplify & accelerate the task of online research.  Both programs automate the process of evaluating documents matching your search terms. Each has strengths & weaknesses which reflect characteristics of the search engines they rely on & the reporting technology they implement.

WebCONC
(by Matthias Hüning)

a tool for generating KWIK-concordances based on webpages (KWIC = Keyword in context). There are two options for defining your corpus: let Google search the relevant webpages for you or specify a set of URLs yourself

WebCorp

Concordances the Web. You enter a word or phrase, choose options from the menus provided & then press the `Submit' button. WebCorp works 'on top of' the search engine of your choice, taking the list of URLs returned by that search engine & extracting concordance lines from each of those pages. All of the concordance lines are presented on a single results page, with links to the sites from which they came. * Also does a frequency listing of words on a web page (from your chosen URL).

The Linguist's Search Engine

can be used to perform syntactic searches (done graphically via parse trees) on Internet data. Currently available are a three-million-sentence corpus of sentences from the Internet Archive as well as facilities to build & search corpora based around search results from AltaVista queries.

Spaceless.com's Web Concordancer

takes the text of a web page you specify & creates a list of sentences that contain the search term. Selecting various options can also produce a concordance of all the words that appear on the page in either alphabetical or frequency order.

GlossaNet

retrieves words or sequences of words from a pre-selected pool of daily newspapers (French, English, Spanish, Italian, Portuguese). If any match occurs, a concordance is sent to the user by email (this is a list of the retrieved occurrences presented in their context (by default, 40 characters to the right & 40 characters to the left) in text or HTML format). You can set up GlossaNet so that concordances are sent to you on a weekly basis.

HighBeam Library Research

Search an archive of more than 35 million documents from over 3,000 sources -- a vast collection of articles from leading publications, updated daily & going back as far as 20 years. Can restrict to: (1) Documents (from Newspapers, Magazines, Journals, Transcripts & Books), (2) Images & Maps , (3) Reference books (Encyclopedias, Dictionaries & Almanacs)

Grammar Safari

Tips on using the web as a corpus for lexical/grammatical (or lexicogrammatical) searches

World-Wide Web English Corpus  (Leeds)

200,000-word web-text samples of National Englishes, compiled from English-language websites in each WWW national domain

Parsed Corpora/Treebanks of English

This list excludes the parsed historical corpora listed above. For parsed corpora in languages other than English, please see this page

American Printing House for the Blind Treebank (APHB)

A skeleton-parsed corpus of a wide range of English texts. 200,000 wds. See description at the UCREL website.

Anaphoric Treebank

A subsample of the AP corpus (English), annotated to show the reference of pronouns & lexical cohesion. Approximately 100,000 wds. See description at the UCREL website.

Associated Press Treebank (AP)

A skeleton-parsed corpus of American newswire reports. 1m wds. See description at the UCREL website.

Canadian Hansard Treebank

A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 wds. See description at the UCREL website.

Diachronic Corpus of Present-day Spoken English (DCPSE)

800,000 wds (87,188 parse trees) of fully-parsed & annotated spoken British English from the 1950s to 1990s; composed of two 400,000-word samples of spoken English from the London-Lund Corpus (late 1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be consistent with ICE-GB & searchable using ICECUP,  (Survey of English Usage, University College London).

International Corpus of English (ICE)

ICE-GB (the British component of ICE) is the first of the ICE corpora to be completed, & is the British component of the International Corpus of English (ICE) Project. It consists of a m wds - 83,394 parse trees, including 59,640 in the spoken part of the corpus- extracted from 200 written & 300 spoken English texts. It is fully grammatically annotated & has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program) an exploration software designed for parsed corpora.

IBM Manuals Treebank

A skeleton-parsed corpus of computer manuals. 800,000 wds. See description at the UCREL website.

Lancaster-Leeds Treebank

A manually parsed subsample of the LOB corpus of English showing the surface phrase structure of each sentence, prepared by Professor Geoffrey Sampson. Approximately 45,000 wds taken from all the genre categories of the LOB corpus. See description at the UCREL website.

Lancaster Parsed Corpus

(manual is here)

a parsed subcorpus of the LOB Corpus of English, parsed by computer & manually corrected by researchers (Roger Garside, Geoffrey Leech & Tamas Varadi). Available through ICAME. It is a treebank consisting of over 133.000 wds from each of the 15 categories of the LOB Corpus. Each sentence is annotated with a phrase-structure parse in the form of labelled bracketing. The labels mark the boundaries of sentence, clause, phrase & coordinated word constituents. The word tags used in the tagged version of the LOB Corpus are also part of the annotation of the Lancaster Parsed Corpus.

Penn Treebank (I & II)

The Penn Treebank Project annotates naturally-occuring text for linguistic structure -- skeletal parses showing rough syntactic & semantic information (a bank of linguistic trees) in addition to part-of-speech tags, & for the Switchboard corpus of telephone conversations, also dysfluency annotation. The original CD-ROM contains over 1.6 m wds of hand-parsed material from the Dow Jones News Service, plus an additional 1 m wds tagged for part of speech; the first fully parsed version of the Brown Corpus, completely retagged using the Penn Treebank tag set; tagged & parsed data from Dept of Energy abstracts, IBM computer manuals, MUC-3 & ATIS. 

Release 2 CDROM features the new Penn Treebank II bracketing style, & contains, among other files, 1 m wds of 1989 Wall Street Journal material annotated in Treebank II style.

To search the corpus for parsed structures, try the Penn Treebank Online (you'll need to know how to use the software 'tgrep') or , obtain Tgrep2 for stand-alone machines (Linux + source code for other platforms).

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations.   It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. Contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies & errors in the original annotation. Can also be searched with Douglas Rohde's TGrep2, version 1.15 or higher.

Polytechnic of Wales Corpus (POW)

consists of approximately 65,000 wds in 11,396 (sometimes very long) lines, each containing a parse tree.

SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English)

130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 wds each from four Brown genre categories) syntactically analysed (treebanked).


 

D-I-Y (do-it-yourself) Corpora: sources of data for building your own corpus

(see also the text archive sites & the audio/visual archive sites)

ABU: la Bibliothèque Universelle  (French)

L'accès libre au texte intégral d'oeuvres du domaine public francophone sur Internet depuis 1993. Pour accéder aux textes, consultez le catalogue des AUTEURS OU CELUI DES textes. Vous pouvez également faire des recherches de mots sur tout le corpus. Nous avons aussi plusieurs dictionnaires.

Bartleby.com: Great Books Online  

Enormously useful site covering much of the same ground as the OTA (but, refreshingly, without the considerable bother of endless copyright restrictions & legal threats). Besides plain texts of prose fiction & non-fiction, poetry & drama, the site includes: an encyclopaedia, gazetteer, world factbook, dictionary, thesaurus, style guides, books of quotations.

Bibliomania

more than 2000 free texts (mostly classics), study guides & reference resources (more for the literary/humanities scholar, but worth a look)

Cyber Classics

more than 200 titles available

EServer

42 collections on such diverse topics as contemporary art, race, Internet studies, sexuality, drama, design, multimedia, accessible publishing & current political & social issues. Also includes hypertexts, audio & video recordings..

Essays.se

a digital resource which enables you to search and download thousands of English-language university essays and theses from Sweden.

The English Server's Fiction collection

works of fiction & about fiction. Collection of texts in the public domain, classified into: Late Antique & Medieval Texts, Renaissance & Early Modern Texts, Modern Fiction, Modern Poetry, Historical Documents, Religious Texts & Other Texts.

Great Books

searchable (with basic concordancing) & browsable texts of English classics (More for the literary/humanities scholar, but worth a look. Whole texts not downloadable in one go.)

Hansard

Parliamentary Proceedings from: the United Kingdom (UK) ;  Canada  ; Australia  ;  New Zealand

(Not really 'corpora' in the sense of fixed, formatted texts, but collections of transcripts)

The sites also have minutes of meetings, bills, reports, bulletins, & other official publications.

Internet Classics Archive

441 works of classical literature by 59 different authors, including user-driven commentary & "reader's choice" Web sites. Mainly Greco-Roman works (some Chinese & Persian), all in English translation.

Movie Script sites

Drew's Scripts-O-Rama / The Movie Script Compendium / Script Central

Movie Subtitles

Watch out for typos and mis-translations

Newspaper sites  for English  (Sampler)

(broadsheets & tabloids)

(You will, of course, have your own links to hundreds of other newspapers, other varieties of English & other languages.)

British Broadsheets:

The Guardian, The Independent, The Telegraph, The Times, The Evening Standard, The Observer, The Sunday Times, The Scotsman, The Herald, The Irish Times

British Tabloids:

The Mirror, The Sun, Daily/Sunday Express, News of the World, The Daily Star, The Sunday Mirror

 

[* More newspaper & magazine links may be found here ]

 

American Newspapers:

The Washington Post, USA Today, The New York Times

Newspaper sites  for Other Languages

Try this  searchable database of Newspapers, Magazines & other media (radio, TV) on the Internet (Kidon Media Link, a meta-site with listings by language & country) or try this site (maintained by IMS Stuttgart).

or the selection below:

French: Le Monde

German: Die Zeit, Die Welt, Süddeutsche Zeitung

Russian: Nezavissimaya Gazeta

Spanish: ABC, El Pais, El Mundo

Renascence Editions (Oregon)

an online repository of works printed in English between 1477 & 1799; includes Shakespeare, Wordsworth, Bacon, Bunyan, Donne, Hume, Hobbes, Milton, Spenser

SketchEngine

a fee-based Corpus Query System incorporating word sketches, grammatical relations, & a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical & collocational behaviour. A 30-day free trial account is available. Web-based service using standard browsers: no software installation required.

Available Resources: (1) Pre-loaded corpora (60M-1.5B wds) for Chinese, English, French, German, Italian, Japanese, Portuguese, Spanish, & Slovene; (2) WebBootCaT (for building your own instant corpus from web pages, then extracting keywords, specialist terminology, etc.); (3) CorpusBuilder (upload & install your own corpora).

TV Transcripts Database

Transcripts of popular US television shows + some movie scripts too.

Transcripts of Spoken News reportage, Debates, Interviews

CNN Transcripts

On-Line Books Page

a directory of books that can be freely read on the Internet. The On-Line Books Page is now hosted by the University of Pennsylvania Library.

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19

about 810,000 Reuters English Language News stories, covering the period 20 August 1996 - 19 August 1997. Format: 806,791 XML files in NewsML format; 365 zip files, one per day, over 2 CDs (2.5 GB when uncompressed); the number of stories per day is not constant, but on weekdays there are on average of 2,880 per day & 480 on weekends. Most stories are around 6-7 paragraphs, 1000 wds. Forthcoming: news stories in other languages, covering the same time period.

UK Parliament home page

on-line minutes of meetings (Lords & Commons); bills, reports, bulletins; Hansard & other publications.

United Nations (UN) web site

Good source for getting parallel texts (for a limited range of topics & genres) in Arabic, English, Chinese, French, Russian & Spanish.

EUR-LEX

Parallel texts concerning European law in several EU languages (Spanish, Danish, German, Greek, English, French, Italian, Dutch, Portuguese, Finnish & Swedish).

OR: Click on this link for an interminable list of free-text links (courtesy of http://www.bookwebsites.com ) or try the Directory of Electronic Text Centers Worldwide or the Library of Congress Internet Resource Page

Multimedia Corpora & Texts with Audio/Visual accompaniments

(includes historical digital library initiatives. Not all are structured/formatted as other standardized text corpora.)

Online audio recordings: UC Berkeley lectures & events

(not strictly a corpus, but...) Audio files of notable lectures & events held at UC Berkeley: interviews & lectures by famous critics, authors & cultural historians, including Aldous Huxley, James Baldwin, Malcolm X, Michel Foucault, Noam Chomsky, Umberto Eco & Claude Lévi-Strauss.

American Rhetoric/ Online Speech Bank

Index to some 400+ active links to 5000+ full text, audio & video (streaming) versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events, & a declaration or two.

BACKBONE (European languages, incl. English) On-line Search here.

BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).

Conversations with History

a collection of interviews (edited/not-faithful-to-the-original transcripts + streaming videos (for most interviews)) with distinguished people from all over the world about their lives & their work (diplomats, statesmen, & soldiers; economists & political analysts; scientists & historians; writers & foreign correspondents; activists & artists). At the heart of each interview is a focus on individuals & ideas that make a difference. The series is produced at the Institute of International Studies at the University of California at Berkeley. Conceived in 1982 as a way to capture & preserve through conversation & technology the intellectual ferment of our times, Conversations with History includes over 300 interviews.

CUCASE

(City University Corpus of Academic Spoken English; forthcoming)

A c. 2-million-word (multimedia) corpus currently being compiled (Jan 2008-Sept 2009, initially) by David Y.W. Lee. Will mirror the design of MICASE & BASE; will contain English spoken at a Hong Kong university (native & non-native speakers).

GeM Corpus

(Genre & Multimodality)

The GeM project ran from 1999 until 2002 & was concerned with developing the first XML annotation scheme for multilayered description of illustrated documents with complex layout. The GeM framework allows layout, rhetorical structure, content & language of different text types to be represented & interrogated. A follow-up project is in the planning stage. Output: 

  • An annotated corpus of newspapers, illustrated bird guides, instruction manuals & websites
  • An XML annotation scheme for illustrated documents
  • A prototype generator (implemented with) XSLT that produces laid-out pages expressed in terms of XSL:FO.

Historical Voices

"The purpose of Historical Voices is to create a significant, fully searchable online database of spoken word collections spanning the 20th century - the first large-scale repository of its kind. Historical Voices will both provide storage for these digital holdings & display public galleries that cover a variety of interests & topics." Includes synchronised text-&-audio RealMedia presentations (see, for e.g., the Flint Sit-Down Strike). Transcripts are not formatted like standardised corpora, but have the advantage of being linked to sound recordings.

Gesture Database (Max Planck Institute in Nijmegen)

consists of the video recordings (no accompanying transcript/corpus texts, as far as I know) of speech & gestures that spontaneously accompany speech, & the annotations regarding gesture & speech in the recording. The recordings were made in different cultures, including the Netherlands, Italy, the USA, Japan, Turkey, Australian Aboriginal communities, Mexico, Belize, & Ghana. Speech events are recorded that elicits spontaneous gestures, such as narration of traditional stories & autobiographical stories, description of the local environment, & route direction.

MICASE (Michigan Corpus of Academic Spoken English (American English).

See fuller description above. Selected audio recordings of the original speech events are available here (streaming ReadAudio), or, for bona fide researchers, in other formats by special arrangement.

Multimedia Adult English Learner Corpus (MAELC)

a database of video of classroom interaction & associated written materials collected from university-level Intensive English Language Program classes at Portland State University; adult ESL classes from beginning to upper-intermediate proficiency; more than 3,600 hours of classroom interaction recorded by six cameras and multiple microphones. Through the ClassAction Query program, users can search the database for clips of media illustrating particular points of second language acquisition or pedagogy. Query returns a playlist of matching clips that can be viewed and refined using the Toolbox program. Playlists made by Query or Toolbox can be viewed with the ClassAction Viewer program which is freely downloadable as a web browser plug-in.

Multimedia Movie Corpus on the Web

Read about this corpus of American movies created using subtitles in five languages (English, French, German, Italian & Spanish)

New South Voices

a project of the Special Collections Unit, J. Murrey Atkins Library, Univ of North Carolina at Charlotte. Provides online access to a unique collection of over 800 interviews, narratives & conversations collected by UNC Charlotte faculty & students & several community organizations documenting the Charlotte region in the 20th century (transcriptions, audio & video files & supplementary materials (primarily photographs)). The interviews cover a wide range of historical subjects, from African American churches & Billy Graham crusades to women's basketball & World War II. Other interviews, narratives & conversations document the experiences & language of new arrivals to the area. Part of Project MORE

Wellington Language in the Workplace Project Corpus

see entry under "Spoken Corpora" above.

 

Dictionary Data/Lexicons

See Software page here.

 

 

* Freely-accessible, On-line Corpora of English

Many language teachers & learners just want to know one simple thing: where are the free, web-accessible corpora that we can search rightaway, without any fuss? There are not many! Here are the major ones. I have left out literary works, newspaper collections & blogs because these you can easily find yourselves & there are millions of them out there.

1.   British National Corpus (BNC) [100m wds; 1990s British English, spoken & written]: There are many different web sites giving free (but limited) access to the corpus--limited due to copyright: i.e. you cannot expand the concordance context to read more of the surrounding text, & you cannot read the entire source texts (only snippets).

  • JustTheWord: The most accessible site for students (& most pedagogically useful) because it straightaway gives you a list of collocations for your search word/phrase, instead of concordances; results are categorized by POS-based patterns & by approximate sense clusters, & graph bars give an indication of how common each combination is. Results are based on a 80K-word subset of the BNC.
  • BYU-BNC (formerly called "VIEW"): allows word-, phrase- or part-of-speech-based searches of the BNC with genre-restrictions; allows wildcards & "fuzzy matches"; can list collocations. Requires registration (free) after about 20 searches.
  • Phrases in English (PIE; make sure you're on the "N-grams" page)-- allows word/phrase searches of the BNC, returning a maximum of 50 random hits (enter words or phrases (up to 8 words), one word per box).
  • BNC online: There is almost no good reason to use this site because it has so many limitations: limited to max. of 50 random hits; limited left/right contexts; only sentence view (the search term is not highlighted & not in the center of the screen); cannot search by POS alone; cannot restrict to specific genres; etc . [Hint for EFL users: Be aware of which genres your concordance examples come from (e.g., teen magazines & informal speech may not always provide the best models of language for writing academic essays]

2.   Corpus of Contemporary American English (COCA): [360 m wds; c. 150,000 texts, 20 m wds each year from 1990-2007.] For each year (& therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, & academic journals. Searchable on-line only; the texts themselves are not available for download.

3.   TIME Magazine Corpus: [100 m words American English, 1923-present; More than 275,000 articles from TIME Magazine. Wide range of topics: news, sports, business, culture, health, entertainment, etc.] Nice search interface (essentially the same as that of the BYU-BNC and COCA).

4.   MICASE [1.7m wds of current, spoken academic American English, as produced by faculty, students & staff in formal & informal settings around the university]: fully searchable & browseable via a custom web interface (no limits), & now has selected playable sound files to accompany some transcripts. Homepage is here.

5.   Word Neighbors [by John Milton. Corpora = a mix of spoken & written English genres (user-selectable); some texts are from the BNC]: Quite similar to JustTheWord in terms of giving lists of collocational patterns first (which are then linked to actual corpus examples), but the text database is bigger (not limited to BNC texts) and you can restrict by medium (spoken/written) and by specific genres. It's a fairly comprehensive learning environment: the collocational/colligational patterns and corpus samples are integrated with on-line dictionaries, thesauri, encyclopaedia, Chinese translations, "Answers.com", JustTheWord, and even audio/video examples containing the phrase/pattern.

6.   Business Letters Corpus [U.S. & U.K. letters, 1m wds as of 1 March 2000; alternative site here]

7.   LOB & Brown [1m wds each; 1960s written British English (LOB) & American English (Brown)]: The Brown Corpus is searchable & browseable via LDC On-line here (no limits). The Brown & LOB are both searchable via the Virtual Language Centre (VLC, or alternative edict site); limited to 2001 hits, without any warning of this maximum. Allows simple searches as well as searches for [word + contextual/associated word] (i.e. only instances where the search word occurs near another specified word). The VLC/edict sites also have other collections of text -- see here for a description & breakdown of these more specialized corpora.

8.   Hong Kong Financial Services Corpus (HKFSC) [7.6 m wds; spoken & written texts collected with the help of professional associations & private organisations from across the financial services sector in Hong Kong: e.g., insurance/investment product descriptions; agreements; media releases; ordinances; procedures; prospectuses; rules; standards; speeches]

9.   CorpusEye: Search various corpora (for many languages). The English corpora include texts culled from Wikipedia and the Enron e-mails.

10. OPUS : [Computer manuals & European parliament speeches] an open-source collection of freely searchable/downloadable parallel corpora (texts with sentence-aligned translations). Not terribly useful texts unless you're teaching Technical English or researching parliamentary speeches.

11. VOA's Special English Program Scripts (by Charles Kelly) [c.14K wds; sentence-view concordances of scripts from Voice of America's "Special English" broadcasts, which use a limited vocab of 1,500 wds (not necessarily the "easiest" English words, but most are simple)] The scripts represent a kind of "written-to-be-spoken" English; useful for less-proficient English learners.

12. CorTec: a bilingual (English & Portuguese) comparable corpus of technical language (linked to the COMET project) in 5 areas: Cooking, Contracts, Computing, Environment & Hypertension. The texts themselves cannot be downloaded, but can be searched via the web tools provided: concordancer, wordlist & N-gram extractor.

13. SACODEYL includes a small corpus of English language teenager talk. Contains structured video interviews with students 13-18 yrs old (seven European languages in total). Annotated and enriched for language learning purposes. Free multimedia access (videos).

14. BACKBONE (European languages, incl. English):  BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).

=============

There are many other corpora which are free, but not on-line, including most of the ICE corpora (just sign a licence & download the files). If you're interested in non-native English, the PICLE Corpus (argumentative essays & literature exam scripts by Polish learners of English) is searchable on-line.

=============

·      See also the section on Using the Web as a corpus (many of these web concordancing search engines allow you to restrict searches to particular countries, institutions, URLs/web sites, etc., thus reducing the amount of junk/unwanted hits), & scrutinize the above section on D-I-Y Corpora for newspapers, out-of-copyright literary texts, & Bibles in various language. Most of these are not 'corpora' in the strict sense of being structured & formatted according to contemporary corpus standards, but are starting points if you want to have your own free texts to run concordances on.

·      XX WordbanksOnline (from the Bank of English) [NO LONGER FREE?]: search a 56-million-word subset of the Bank of English (sub-dividable into 3 broad categories); also usefully allows you to specify a following word by part of speech, & gives collocations (limited to 40 hits, & the total number of hits is not reported :-(  )

** I've left out something? If you know of other web-searchable corpora, do let me know.

Footnotes

* Question: Ok, so how do I actually get my hands on on these hundreds of corpora, & how can I search them?

   Short Answer: Depends on the corpus. If the corpus you want is not publicly available & not searchable on-line, then you'll have to get the necessary licence, pay any fees, & use the appropriate concordancer/tool for the task (see "Software" section), bearing in mind the mark-up scheme, annotation tags, etc. used in that corpus. Some concordancers (e.g., WordSmith) can ignore mark-up.

* Question: What are the differences among the terms "corpus", "collection" & "data archive"?

    Some definitions (from Atkins, S, Clear, J, & Ostler (1992) Corpus design criteria. Literary & Linguistic Computing, 7(1), pp.1-16)

·       Archive: a repository of readable electronic texts not linked in any coordinated way, e.g. the Oxford Text Archive

·       Electronic Text Library (or ETL, Fr. 'textothèque'): a collection of electronic texts in standardized format with certain conventions relating to content, etc., but without rigorous selectional constraints.

·       Corpus: a subset of an ETL, built according to explicit design criteria for a specific purpose, e.g. the Corpus Révolutionnaire (Bibliothèque Beaubourg, Paris), the Cobuild Corpus, the Longman/Lancaster corpus, the Oxford Pilot corpus.


If you need help with file formats for some of the downloads, [click here]

Have you found this web site/page useful? Do let me know if you want to encourage me to keep updating the site, or if you have a new corpus or resource (or something I've missed) for me to link to, please drop me a line.

[TOP]

Jump to:


Major English Corpora |  Other Recent English Corpora |  Speech Corpora  |  Historical Corpora | 1st language acquisition  |  Learner Corpora  |  Specialised Corpora  |  Text Archives & Corpus Distribution Sites  |  Non-English & Multilingual Corpora  |  The Web as a Corpus |  Parsed Corpora  |  D-I-Y Corpora |  Audio/Visual 'Corpora' |  Free Web-accessible Corpora  | [Bookmarks HOME]

Back to HOME (tiny.cc./corpora)[Bookmarks HOME]

 [ If you've surfed in from somewhere else & want to know what this site is about, click the home icon to go to my entrance page ]


Last Updated: 30 April 2010 16:53:20

© David Lee