David Lee
The permanent URL
for this page is: http://tiny.cc/bncindex
Bookmarks
for Corpus-based Linguists | BNC Index
| ICE-GB Indexes
![]()
Resources
for Sub-corpus/Genre Analysis
|
The BNC Index (for the BNCWorld Edition)
|
The BNC Index spreadsheet I have prepared is
similar to the plain text ones prepared by Adam Kilgarriff (available here)
which I have benefited from, and found rather useful. However, Kilgarriff’s
list did not contain all the details I needed for compiling my own sub-corpus
(e.g. author type, author age, author sex, audience type, audience sex, section
of text sampled, speaker age, speaker sex, region, etc.). Sebastian Hoffman’s
files (which used to be available here)
were useful too, in a complementary way, but they do not include (i) keywords,
and (ii) the full bibliographical details of files. The
“bncfinder.dat” file which comes with the standard distribution of
the BNC has most of the header
information, but uses highly abbreviated numeric codes, and also does not
include bibliographical information about the files. The BNC (Version 1) version
also contains a large number of errors (affecting hundreds of files and
millions of words of text).
What I did, therefore, was consolidate the
information from the above 3 sources, and in most cases, scanned the contents of files and checked various classification
details. In addition, I have included: (i) BNC-supplied Keywords (as
entered by the BNC compilers) (ii) COPAC Keywords
[1] for published non-fiction texts (these were
manually cut and pasted by hand) (iii) full bibliographical details (including
title, year of publication and publisher for written texts, and number of
participants for spoken files. ‘Year of publication’ and
‘Author Age’ have been extensively corrected and revised, based on
my COPAC searches.) (iv) an extra level of text categorisation, Genre,
where each text is assigned to one of the 70 genres/sub-genres (24 spoken and
46 written) developed for the purposes of this Index (see documentation
included in the zipped file below, or view the pdf-format version here; for a fuller
description of and justification for the genre categorisation scheme, see my LLT
paper available on-line here).
This Index should be a comprehensive
(‘one-stop’!), user-friendly databank on what is in the BNC. Note,
however, that if you want to deal with sub-text-level phenomena, e.g. the
speaker-related extra-linguistic variables of dialect, sex, age, social class,
education, etc., you’ll need to use SGML-aware software like SARA.
The BNC Index aims to be easy-to-use and
‘browseable’, and I have used clear abbreviations for most of the
fields instead of cryptic code numbers. For example, the medium of texts
is a separate field, with labels such as ‘m_pub’ (for
‘miscellaneous published’) instead of a number, and domains
are likewise indicated by abbreviated strings (e.g.
‘W_ac_soc_science’, ‘W_leisure’), again in a separate
field, to ease searching.
Reason for creating this new resource: I
needed all the available information I could get on the BNC texts in order to
choose my own sub-set of the BNC for my research. I needed, for example, to
separate children’s fiction from adult fiction, and sometimes
details like the author’s name or the name of the publisher can give you
more information about how to classify a text (e.g. to decide if something is
‘academic science’ or a popularisation of science). SARA and BNCWeb allow subcorpus queries.
People using stand-alone PC concordancers can now specify subcorpora at the
file level by first using the BNC Index to obtain relevant file IDs.
Use the BNC Index to get the file IDs of the BNC files which match the criteria.
These file IDs can then be fed into concordancers such as WordSmith or MonoConc[3] which can use a list of filenames to
specify a sub-corpus to which future queries are to be restricted. Note
that individual files can always be deleted from the output list if so
desired, so users do not have to accept the classification decisions wholesale
but can vet individual texts before allowing them into a sub-corpus. Indeed, I
would strongly advise checking each file individually where time permits
rather than relying totally on any of the classifications (domain, genre,
spontaneity, etc.).
(all versions
are dated: see below for details; date of first release was 20/10/2001)
1. Click here to download the BNC_WORLD_INDEX.ZIP (zip file containing
MS Excel file and PDF documentation)
2. Please read the documentation included and use with caution
Entries which I have changed unilaterally (i.e. those which are different from
the official BNC Version 1 file
header entries, or, in some cases, from even the BNC World Edition
headers) are marked in red in the spreadsheet (e.g. see entry for file
JSE, where the official BNC Version 1 header did not have a domain
classification, whereas in the spreadsheet it is classified as
'business'). Unfortunately, there is no way of telling whether something
is different with respect to version 1 or the World Edition (I did not make a
distinction), so red just signals a
change.
Some of these changes
have been incorporated into the official World Edition (e.g. many newspaper
files are now no longer classified as "domain=imaginative"), but some have not (e.g. file HJL is a
business letter soliciting donations for a cause (saving badgers), but is
wrongly classified as "domain=imaginative" in both BNC v.1 and the
World Edition). Hence, if getting the domain (or genre) right is important for
your research, my advice is that you should always always use the latest
BNC Index posted on this web site, rather relying on than the official headers.
Many (but not all) mistakes in the original BNC text categorisations (e.g. in terms of Domain, Author Age, etc.) have been corrected, but end-users will undoubtedly find many more. Genre categorisations are also subject to change (please e-mail me if you spot any text which has definitely been categorised wrongly). The good thing about this BNC Index project is that mistakes can be corrected instantly and updated in the spreadsheet. So please feel free to report to me if you spot anything at all which needs correcting.
3. Do the
referencing: If you make use of the BNC Index, you should acknowledge the fact by referring
to the following article: Lee, David Y.W. 2001. Genres, registers, text
types, domains and styles: clarifying the concepts and navigating a path
through the BNC jungle. Language Learning & Technology, Vol.5(3):
37-72. [Available at: http://llt.msu.edu/vol5num3/lee/default.html]
- 22 Dec 2003:
HHY, HJ0, HJ1, and HJ2 were wrongly classified as
non-academic. In fact, each of these files contains a collection of successful
ESRC grant proposal abstracts, so they may be considered academic English. They
have thus been reclassified as W_ac_soc_science.
- 24 Nov 2003:
One change: file EED was wrongly assigned to the
genre W_advert. This was because the first few texts were pamphlets by British
Rail on various services. In fact, this BNC file is a composite of many
different types of text, including brief customer-information leaflets/brochures,
"Roald Dahl's Guide to Railway Safety" (written for children/teens),
and fact sheets (more formal English) from
- 16 June 2003:
One change: file KM6 has a wrong description in the
title field. It reads "
- 30 Oct 2002: More BNC World
Edition errors spotted by me]
- Changed 'Interaction Type' from 'monologue' to
'dialogue' for the following files: HEW and KJT ('live' sports
commentaries). They predominantly consist of casual chat and banter between the
commentators.
- HPS probably does not belong to the genre
'W_essay_univ' as it is unlikely to have been written by a university student.
I have now placed it under 'W_misc' because it is unpublished. Alternatively,
it may be considered under 'W_non_ac_humanities
Other earlier changes are logged in the BNCWIndexNotes document which
accompanies the BNC Index.
[Other updates: 4/5/02
(Changed explanation of the genre labels 'S_speech_scripted' and
'S_speech_unscripted' to make it clearer that 'speech' does not necessarily
refer to prepared monologues, but to any spoken discourse (whether
dialogue or monologue); 25/11/01 (Changed all documentation to PDF format; 20 Oct
2001 (spelling errors in ‘Keywords’ field [made by BNC compilers]
now corrected in spreadsheet); 26/6/2001 (minor changes to documentation only)]
Index for first release/BNC version 1
(*not* recommended, due to large no. of classification and tagging errors
in this version of the corpus), available in MS Excel format: BNC_INDEX.ZIP (old, BNC version 1; this will no longer be updated)
The
advantage of putting all this information in MS Excel format is that there is a
quick way of displaying only the texts you want through the use of the
‘Autofilter’ (go to Data, Filter, Autofilter) or ‘Custom
Filter’ (for more complex filtering using wildcards). Another bonus with
the Excel format is that you can easily copy and paste the lines displayed into
your word processor for referencing, retaining the tabular format.
· Autofilter: With the
Autofilter switched on, every field (column) on the top line now has a
drop-list button, which you can use to filter the view to only the texts you
want (e.g. by selecting the genre ‘W_commerce’, all the
non-commerce, non-written files will be hidden from view). Fields are combinable
(so you can first restrict the display to only ‘social science’
texts, then further restrict this to only ‘periodicals’).
· Custom Filter: (requires Autofilter to be turned on first) To
make more advanced searches, and to make use of the fields with
hierarchical codes (e.g. to search for all lectures (S_lect*) instead
of specific sub-genres like S_lectures_soc_science or S_lect_commerce),
choose ‘Custom’ filter from the relevant drop-list menu for
the field you want to query. This kind of advanced search is easy to learn and
relatively fast. I have used it, for example, to search for all BNC files with
the word “legal” in in the Title column/field. [Steps:
Choose custom from the drop-list in the Title column and specify
“=“ or “contains” (depending on your version of Excel) and type “*legal*” (or just “legal” if you chose “contains”)
in the neighbouring box.]
The Excel file I have made available for download
was saved with the ‘Autofilter’ switched on, so you will see the
drop-list arrows on the first line of the spreadsheet upon opening it.
The bibliographical spreadsheets below for the ICE-GB
corpus were compiled by combining the information available in the separate
‘internals’ and ‘sources’ files which came on the
CD-ROM. I found it really annoying that the bibliographical and detailed
information about the files were not only in separate files, but also in a
format which is not user-friendly or conducive to casual browsing. Also, some
of the fields contained rather non-useful information and very many of the
fields were redundant or wasteful in the sense that they contained information
which was relevant for only a small number of the texts (e.g
‘Channel’ for the spoken texts only applied to files which were
captured from radio or TV, but including it as a separate field meant an extra
column for which all the others files received a blank entry!). I therefore
examined the separate files and carefully consolidated all the information
into one big database, conflating some fields so that some of them can now
contain multiple types of information which do not clash (e.g. Channel, Subject
and Audience were mutually exclusive categories in the original files, so the 3
fields could be merged into 1 without losing any information). By restricting
the number of fields included and by careful formatting, I have also managed to
ensure that the database fits nicely on an A4 page (landscape orientation) and
can thus be read easily when printed out.
Download the ICE-GB Bibliographical Spreadsheets
SPOKEN ICE-GB
MS Excel
format: ssources.zip (110,593 bytes)
Tab-delimited,
plain text format: ice_s.zip (39,424 bytes)
WRITTEN ICE-GB
MS Excel
format: wsources.zip (115,712 bytes)
Tab-delimited,
plain text format: ice_w.zip (43,003 bytes)
|
|
Bookmarks
for Corpus-based Linguists (
|
This web site started as a resource for the
students learning about corpus-based methods, but was then extensively expanded
and has since been visited by people from all over the world. There are already
a number of sites out there with similar content, but here are a few key things
about my site:
·
it's up-to-date (I've checked practically
all the links to make sure there are no dead ones, except those which are
permanently dead (i.e. no known new URL!)
·
it focuses on links for linguists and lg
teachers (not NLP/lg engineering)
·
my listings are mostly annotated (i.e. have
descriptions of the links, so you don't always have to click a link to find out
what it's about)
·
it brings together in one place all the
information on corpora, software tools, bibliographies, references, electronic
papers, mailing lists, on-line courses, conferences, etc. that people doing
corpus work will possibly need.
In effect, this web site provides a solid basis for
any course in corpus-based linguistics, and includes lots of ideas for
extensions in many directions.
The web site is, I believe, fairly exhaustive (for
English corpora, tools, and references, at least), but I would make the usual
plea for people to contact me with more links and resources that I've missed,
if they spot any mistakes (dead links, non-existent sites, wrong information,
etc.), and especially if they have written papers/notes/squibs which are
available on-line for downloading.
Please take the trouble to let me know about
anything you'd like to share with the rest of the research community (links,
papers and resources... e.g. if you've collected a (small) corpus or collection
of materials which is available on-line or could be made available). The
usefulness of the site will be increased if people actively participate to keep
it current and complete. Please bookmark the URL alias for my site, http://tiny.cc/corpora, rather than
anything else, as this is permanent, whereas other page/frame addresses may
well change their names without warning. The downside of using this mnemonic
alias is that a little advert window pops up... just ignore it and close it
down immediately.
I would appreciate it if people could have a look
and give me feedback (e.g. "I think it's great!" or if you have any
suggestions on how to improve the structure/organisation of information, or if
there are any glaring omissions), bearing in mind that this site is meant
primarily for *linguists* or lg teachers (and, secondarily, for humanities
scholars) who happen to work with corpora, not speech technologists or NLP
people (although I've also provided the most important 'technical' links, so
that people who wish to get more info in that direction may do so).
I think that this site is more organised and more
complete than most of the other sites that I've seen, which are geared more
towards NLP /language technology, and also tend to lump everything on
one page (so that you have to wade through lots of undifferentiated stuff to
find what you want). I've tried not to replicate other bookmark sites (e.g.
Mike Barlow's and Manuel Barbera's (whose links for non-English corpora
I have not even tried to duplicate) but at the same time I've deliberately
repeated some of the main links for the sake of convenience (there is no point
in continually sending people to other people's web sites!), so that at much
info as possible is provided on-site. Hope this will be of use to some.
![]()
Navigation
Go to my Academic Home page
Please send brickbats/bouquets/bug
reports/comments to me here.
[1] COPAC is an on-line system for unified access
to the (combined) catalogues of some of the largest university research
libraries in the
[2] BNC Web Indexer is the result of
a collaboration between Paul Rayson (UCREL,
[3] Or using the web-based concordancer for
the BNC developed at Zürich, BNCweb, at http://bncweb.lancs.ac.uk/
(restricted usage).