The permanent URL for this page is: http://tiny.cc/bncindex
Resources for Sub-corpus/Genre Analysis
The BNC Index (for the BNCWorld Edition)
The BNC Index spreadsheet I have prepared is similar to the plain text ones prepared by Adam Kilgarriff (available here) which I have benefited from, and found rather useful. However, Kilgarriff’s list did not contain all the details I needed for compiling my own sub-corpus (e.g. author type, author age, author sex, audience type, audience sex, section of text sampled, speaker age, speaker sex, region, etc.). Sebastian Hoffman’s files (which used to be available here) were useful too, in a complementary way, but they do not include (i) keywords, and (ii) the full bibliographical details of files. The “bncfinder.dat” file which comes with the standard distribution of the BNC has most of the header information, but uses highly abbreviated numeric codes, and also does not include bibliographical information about the files. The BNC (Version 1) version also contains a large number of errors (affecting hundreds of files and millions of words of text).
What I did, therefore, was consolidate the information from the above 3 sources, and in most cases, scanned the contents of files and checked various classification details. In addition, I have included: (i) BNC-supplied Keywords (as entered by the BNC compilers) (ii) COPAC Keywords  for published non-fiction texts (these were manually cut and pasted by hand) (iii) full bibliographical details (including title, year of publication and publisher for written texts, and number of participants for spoken files. ‘Year of publication’ and ‘Author Age’ have been extensively corrected and revised, based on my COPAC searches.) (iv) an extra level of text categorisation, Genre, where each text is assigned to one of the 70 genres/sub-genres (24 spoken and 46 written) developed for the purposes of this Index (see documentation included in the zipped file below, or view the pdf-format version here; for a fuller description of and justification for the genre categorisation scheme, see my LLT paper available on-line here).
This Index should be a comprehensive (‘one-stop’!), user-friendly databank on what is in the BNC. Note, however, that if you want to deal with sub-text-level phenomena, e.g. the speaker-related extra-linguistic variables of dialect, sex, age, social class, education, etc., you’ll need to use SGML-aware software like SARA.
The BNC Index aims to be easy-to-use and ‘browseable’, and I have used clear abbreviations for most of the fields instead of cryptic code numbers. For example, the medium of texts is a separate field, with labels such as ‘m_pub’ (for ‘miscellaneous published’) instead of a number, and domains are likewise indicated by abbreviated strings (e.g. ‘W_ac_soc_science’, ‘W_leisure’), again in a separate field, to ease searching.
Reason for creating this new resource: I needed all the available information I could get on the BNC texts in order to choose my own sub-set of the BNC for my research. I needed, for example, to separate children’s fiction from adult fiction, and sometimes details like the author’s name or the name of the publisher can give you more information about how to classify a text (e.g. to decide if something is ‘academic science’ or a popularisation of science). SARA and BNCWeb allow subcorpus queries. People using stand-alone PC concordancers can now specify subcorpora at the file level by first using the BNC Index to obtain relevant file IDs.
Use the BNC Index to get the file IDs of the BNC files which match the criteria. These file IDs can then be fed into concordancers such as WordSmith or MonoConc which can use a list of filenames to specify a sub-corpus to which future queries are to be restricted. Note that individual files can always be deleted from the output list if so desired, so users do not have to accept the classification decisions wholesale but can vet individual texts before allowing them into a sub-corpus. Indeed, I would strongly advise checking each file individually where time permits rather than relying totally on any of the classifications (domain, genre, spontaneity, etc.).
2. Please read the documentation included and use with caution
Entries which I have changed unilaterally (i.e. those which are different from the official BNC Version 1 file header entries, or, in some cases, from even the BNC World Edition headers) are marked in red in the spreadsheet (e.g. see entry for file JSE, where the official BNC Version 1 header did not have a domain classification, whereas in the spreadsheet it is classified as 'business'). Unfortunately, there is no way of telling whether something is different with respect to version 1 or the World Edition (I did not make a distinction), so red just signals a change.
Some of these changes have been incorporated into the official World Edition (e.g. many newspaper files are now no longer classified as "domain=imaginative"), but some have not (e.g. file HJL is a business letter soliciting donations for a cause (saving badgers), but is wrongly classified as "domain=imaginative" in both BNC v.1 and the World Edition). Hence, if getting the domain (or genre) right is important for your research, my advice is that you should always always use the latest BNC Index posted on this web site, rather relying on than the official headers.
Many (but not all) mistakes in the original BNC text categorisations (e.g. in terms of Domain, Author Age, etc.) have been corrected, but end-users will undoubtedly find many more. Genre categorisations are also subject to change (please e-mail me if you spot any text which has definitely been categorised wrongly). The good thing about this BNC Index project is that mistakes can be corrected instantly and updated in the spreadsheet. So please feel free to report to me if you spot anything at all which needs correcting.
3. Do the referencing: If you make use of the BNC Index, you should acknowledge the fact by referring to the following article: Lee, David Y.W. 2001. Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, Vol.5(3): 37-72. [Available at: http://llt.msu.edu/vol5num3/lee/default.html]
- 22 Dec 2003:
HHY, HJ0, HJ1, and HJ2 were wrongly classified as non-academic. In fact, each of these files contains a collection of successful ESRC grant proposal abstracts, so they may be considered academic English. They have thus been reclassified as W_ac_soc_science.
- 24 Nov 2003:
One change: file EED was wrongly assigned to the
genre W_advert. This was because the first few texts were pamphlets by British
Rail on various services. In fact, this BNC file is a composite of many
different types of text, including brief customer-information leaflets/brochures,
"Roald Dahl's Guide to Railway Safety" (written for children/teens),
and fact sheets (more formal English) from
- 16 June 2003:
One change: file KM6 has a wrong description in the
title field. It reads "
- 30 Oct 2002: More BNC World Edition errors spotted by me]
- Changed 'Interaction Type' from 'monologue' to 'dialogue' for the following files: HEW and KJT ('live' sports commentaries). They predominantly consist of casual chat and banter between the commentators.
- HPS probably does not belong to the genre 'W_essay_univ' as it is unlikely to have been written by a university student. I have now placed it under 'W_misc' because it is unpublished. Alternatively, it may be considered under 'W_non_ac_humanities
Other earlier changes are logged in the BNCWIndexNotes document which accompanies the BNC Index.
[Other updates: 4/5/02 (Changed explanation of the genre labels 'S_speech_scripted' and 'S_speech_unscripted' to make it clearer that 'speech' does not necessarily refer to prepared monologues, but to any spoken discourse (whether dialogue or monologue); 25/11/01 (Changed all documentation to PDF format; 20 Oct 2001 (spelling errors in ‘Keywords’ field [made by BNC compilers] now corrected in spreadsheet); 26/6/2001 (minor changes to documentation only)]
Index for first release/BNC version 1 (*not* recommended, due to large no. of classification and tagging errors in this version of the corpus), available in MS Excel format: BNC_INDEX.ZIP (old, BNC version 1; this will no longer be updated)
The advantage of putting all this information in MS Excel format is that there is a quick way of displaying only the texts you want through the use of the ‘Autofilter’ (go to Data, Filter, Autofilter) or ‘Custom Filter’ (for more complex filtering using wildcards). Another bonus with the Excel format is that you can easily copy and paste the lines displayed into your word processor for referencing, retaining the tabular format.
· Autofilter: With the Autofilter switched on, every field (column) on the top line now has a drop-list button, which you can use to filter the view to only the texts you want (e.g. by selecting the genre ‘W_commerce’, all the non-commerce, non-written files will be hidden from view). Fields are combinable (so you can first restrict the display to only ‘social science’ texts, then further restrict this to only ‘periodicals’).
· Custom Filter: (requires Autofilter to be turned on first) To make more advanced searches, and to make use of the fields with hierarchical codes (e.g. to search for all lectures (S_lect*) instead of specific sub-genres like S_lectures_soc_science or S_lect_commerce), choose ‘Custom’ filter from the relevant drop-list menu for the field you want to query. This kind of advanced search is easy to learn and relatively fast. I have used it, for example, to search for all BNC files with the word “legal” in in the Title column/field. [Steps: Choose custom from the drop-list in the Title column and specify “=“ or “contains” (depending on your version of Excel) and type “*legal*” (or just “legal” if you chose “contains”) in the neighbouring box.]
The Excel file I have made available for download was saved with the ‘Autofilter’ switched on, so you will see the drop-list arrows on the first line of the spreadsheet upon opening it.
The bibliographical spreadsheets below for the ICE-GB corpus were compiled by combining the information available in the separate ‘internals’ and ‘sources’ files which came on the CD-ROM. I found it really annoying that the bibliographical and detailed information about the files were not only in separate files, but also in a format which is not user-friendly or conducive to casual browsing. Also, some of the fields contained rather non-useful information and very many of the fields were redundant or wasteful in the sense that they contained information which was relevant for only a small number of the texts (e.g ‘Channel’ for the spoken texts only applied to files which were captured from radio or TV, but including it as a separate field meant an extra column for which all the others files received a blank entry!). I therefore examined the separate files and carefully consolidated all the information into one big database, conflating some fields so that some of them can now contain multiple types of information which do not clash (e.g. Channel, Subject and Audience were mutually exclusive categories in the original files, so the 3 fields could be merged into 1 without losing any information). By restricting the number of fields included and by careful formatting, I have also managed to ensure that the database fits nicely on an A4 page (landscape orientation) and can thus be read easily when printed out.
Download the ICE-GB Bibliographical Spreadsheets
MS Excel format: ssources.zip (110,593 bytes)
Tab-delimited, plain text format: ice_s.zip (39,424 bytes)
MS Excel format: wsources.zip (115,712 bytes)
Tab-delimited, plain text format: ice_w.zip (43,003 bytes)
A comprehensive web site of annotated links for corpus-based linguistics
This web site started as a resource for the students learning about corpus-based methods, but was then extensively expanded and has since been visited by people from all over the world. There are already a number of sites out there with similar content, but here are a few key things about my site:
· it's up-to-date (I've checked practically all the links to make sure there are no dead ones, except those which are permanently dead (i.e. no known new URL!)
· it focuses on links for linguists and lg teachers (not NLP/lg engineering)
· my listings are mostly annotated (i.e. have descriptions of the links, so you don't always have to click a link to find out what it's about)
· it brings together in one place all the information on corpora, software tools, bibliographies, references, electronic papers, mailing lists, on-line courses, conferences, etc. that people doing corpus work will possibly need.
In effect, this web site provides a solid basis for any course in corpus-based linguistics, and includes lots of ideas for extensions in many directions.
The web site is, I believe, fairly exhaustive (for English corpora, tools, and references, at least), but I would make the usual plea for people to contact me with more links and resources that I've missed, if they spot any mistakes (dead links, non-existent sites, wrong information, etc.), and especially if they have written papers/notes/squibs which are available on-line for downloading.
Please take the trouble to let me know about anything you'd like to share with the rest of the research community (links, papers and resources... e.g. if you've collected a (small) corpus or collection of materials which is available on-line or could be made available). The usefulness of the site will be increased if people actively participate to keep it current and complete. Please bookmark the URL alias for my site, http://tiny.cc/corpora, rather than anything else, as this is permanent, whereas other page/frame addresses may well change their names without warning. The downside of using this mnemonic alias is that a little advert window pops up... just ignore it and close it down immediately.
I would appreciate it if people could have a look and give me feedback (e.g. "I think it's great!" or if you have any suggestions on how to improve the structure/organisation of information, or if there are any glaring omissions), bearing in mind that this site is meant primarily for *linguists* or lg teachers (and, secondarily, for humanities scholars) who happen to work with corpora, not speech technologists or NLP people (although I've also provided the most important 'technical' links, so that people who wish to get more info in that direction may do so).
I think that this site is more organised and more complete than most of the other sites that I've seen, which are geared more towards NLP /language technology, and also tend to lump everything on one page (so that you have to wade through lots of undifferentiated stuff to find what you want). I've tried not to replicate other bookmark sites (e.g. Mike Barlow's and Manuel Barbera's (whose links for non-English corpora I have not even tried to duplicate) but at the same time I've deliberately repeated some of the main links for the sake of convenience (there is no point in continually sending people to other people's web sites!), so that at much info as possible is provided on-site. Hope this will be of use to some.
Go to my Academic Home page
Please send brickbats/bouquets/bug reports/comments to me here.
 COPAC is an on-line system for unified access
to the (combined) catalogues of some of the largest university research
libraries in the
 BNC Web Indexer is the result of
a collaboration between Paul Rayson (UCREL,
 Or using the web-based concordancer for the BNC developed at Zürich, BNCweb, at http://bncweb.lancs.ac.uk/ (restricted usage).