Bookmarks for Corpus-based Linguists 

Concordancers  |  Concordancing Complements  | Text Coding/Annotation Programs/ Text-analysis Tools  |   Word/Frequency Lists  |  Tools & Resources for Speech Data  |  On-line Dictionaries/Lexicons  |  NLP resources (inc. taggers) Format Conversion Tools | HTML code strippers Web snaggers Fonts for Multilingual Computing | [Bookmarks HOME]


Software, Tools, Frequency Lists, etc.


If you need help with file formats for some of the downloads, [click here]

Concordancers, Search Engines, Text-analysis Tools (some are free)

Tools that are free and/or, in my opinion, are the most generally useful are at the top, but otherwise the tools are more or less randomly ordered; not alphabetical

AntConc
(by Laurence Anthony)

(For discussion/ support group, click here)

The best free concordancer for Windows, Mac OSX and Linux that I know of. Some commercial programs may have a couple more features, but this one's free, so don't complain! Pros: works with all languages (fully Unicode compliant);  allows full regular expressions (for very complex searches); does word lists and keywords (by comparing against a reference corpus); does distribution plots of occurrences within each file; can handle lemma lists; can handle XML-type and underscored_tag-type part-of-speech tags; the developer continually improves it and is open to feedback (and may I emphasize that it's free?). Cons: (at the moment...) very minimal support for SGML/XML/HTML corpora (it simply ignores rather than intelligently mines structural tags) but that's a problem common to most concordancers.

WordSmith Tools

Mike Scott's excellent Windows-based concordancer (has a 'Keywords' function to compare corpora); Latest verion includes "concgrams". Limitations: no real SGML/XML awareness (but there are workarounds).

MonoConc Pro (v.2.2)

concordancer for Windows with powerful (regular expression) search facilities. Good points: ability to show/hide tags; colour-codes collocates within the main concordance window itself; handles many languages (including Chinese, Japanese and Korean); the Advanced Collocations feature (similar to WordSmith's 'clusters' feature, but does other things too) is great. Not-so-good points: not as flexible/customisable as WordSmith.

UAM Corpus Tool 

Mainly used an annotation tool, but can be used for corpus analysis. See entry under "Text Coding/Manual Annotation Programs"

NooJ
(by Max Silberztein)

A free corpus-processing tool and linguistic engineering development platform/environment. Can be used as: corpus processor, information extraction system, terminological extractor, Machine Translation development tool, tool for teaching linguistics & computational linguistics.

 

Allows linguists to formalize several levels of linguistic phenomena: orthography and spelling, lexicons for simple words, multiword units and frozen expressions, inflectional, derivational and productive morphology, local, structural syntax and transformational syntax. As a corpus processing tool, NooJ allows users to apply sophisticated linguistic queries to large corpora in order to build indices and concordances, annotate texts automatically, perform statistical analyses, etc. Linguistic modules can already be freely downloaded for many languages.

Characteristics: (1) can process texts in over 100+ file formats, including HTML, PDF, MS-OFFICE, all variants of UNICODE. (2) can import information from, and export its annotations back to XML documents. (3) annotation system that allows all levels of grammars to be applied to texts without modifying them; this allows linguists to formalize various phenomena independently, and to apply the corresponding grammars in cascade. For instance, by combining inflection, derivation and syntactic data, NooJ can perform Harris-type transformations.

Dexter
(by Gregory Garretson)

free suite of software tools that enable you to perform qualitative coding of corpus texts; Java-based, so it is cross-platform (Windows, Mac, Linux). Dexter is written specifically with three things in mind: spoken language data, researcher-collected data, and analysis of discourse-level phenomena. Dexter Coder displays your document in a window and allows you to define and add annotations to the document. You can perform complex searches of the text and codes, and certain quantitative analyses, with the Coder. All annotations are saved in a separate standoff XML file. The input data may be in various formats; it is converted to XML to enable stand-off markup, which in turn enables an unlimited number of analyses without affecting the source data. 

aConCorde (by Andrew Roberts)

a free multi-lingual concordance tool. Originally developed for native Arabic concordancing, it possesses basic concordancing functionality, as well as English and Arabic interfaces. Java-based, so will run on any platform that has the Java Runtime Environment installed.

Qwick

a Java concordancer by Oliver Mason, included with the BNC Sampler. Available through TRACTOR.

Concapp (Chris Greaves)

free and rather nifty concordancer for Windows which has support for Chinese (Big5 and GB formats) & Japanese texts as well (haven't personally tested this feature myself).  A web version is here -- limited to using a fixed selection of texts, e.g. the Brown corpus.

CasualConc

(for Mac with Mac OS X 10.5.5 (Leopard) or 10.6 (Snow Leopard)

KWIC concordance lines, word clusters, collocation analysis, and word counts.

Conc (John Thomson / SIL)

free concordancer for Macintosh computers.

Multiconcord

Multilingual Parallel Concordancer for Windows. It uses truly parallel texts; that is texts which relate to the same source.

Conc (Mario Saraceni)

basic but free concordancer for Linux. (or try the web versions here -- limited to using a fixed selection of texts).

Concordancer for Windows (WConcord)

basic but free concordancer for Windows. Used to be available for download here, but it's disappeared, so I'm making it available on my site here.

Concorder/Le Concordeur (D. W. Rand)

a free concordancer for Macintosh computers; either in a PPC version (which runs only on PowerMacs) or in a 68K version (for the old generation of Macintoshes, but which runs also on PowerMacs); handles a variety of languages such as English, French, Russian, etc., i.e. any language whose alphabet can be encoded as one byte per character, written from left to right, and for which an appropriate font is available.

SCP

free concordancer for Windows

TextStat (Matthias Hning)

freeware concordancer; reads ASCII/ANSI texts (in different encodings). HTML files (directly from the internet) and MS Word and and OpenOffice files (no conversion needed). Produces word frequency lists & concordances (uses regular expressions). Includes a web-spider which reads as many pages as you want from a particular web site and puts them in a TextSTAT-corpus. The news-reader puts news messages in a TextSTAT-readable corpus file.

Multilingual interface and uses Unicode internally: can cope with many different languages & file encodings. Written in Python: should run everywhere where Python runs (Windows XP, Linux, MacOS X).

Multilingual Concordancer (mltc) (Scott Piao)

free Java-based concordancer that supports Unicode for multilingual concordancing. Works on the Windows platform with the Java Runtime Environment (JRE) installed.

See also Piao's Java suite of tools/package for multilingual computing (specifically, Chinese, Korean & English), MultiLingProc, available on the same page.

NoSketchEngine,

Manatee, Bonito2

NoSketch Engine is an open-source project combining Manatee and Bonito into a powerful and free corpus management system. It is essentially a limited version of the software empowering the Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and other features.

Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures. It utilitates a fast indexing library called Finlib. Bonito is a graphical user interface to corpora mantained by Manatee. It is available as a standalone graphical application in Tcl/Tk (version Bonito1, not developed/supported anymore) and web interface in Python (version Bonito2, under constant development).

SketchEngine

a fee-based Corpus Query System (based on Manatee) incorporating word sketches, grammatical relations, and a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a words grammatical and collocational behaviour. A 30-day free trial account is available. Web-based service using standard browsers: no software installation required.

Available Resources: (1) Pre-loaded corpora (60M-1.5B words) for Chinese, English, French, German, Italian, Japanese, Portuguese, Spanish, & Slovene; (2) WebBootCaT (for building your own instant corpus from web pages, then extracting keywords, specialist terminology, etc.); (3) CorpusBuilder (upload and install your own corpora)

TACT

a system of 15 free programs for MS-DOS (& up to Windows 2000), is designed to do text-retrieval and analysis on small- to medium-sized literary works (not for large corpora).

TACTweb 

TACTweb connects the TACT concordancer to the World Wide Web, making TACT databases (.TDB) accessible on the web via browsers.

LinguaSy Power Concordancer

free concordancer for Windows (via free-esl.com)

TAPoRware 0.2

a set of free text analysis tools that enable users to perform text analysis on XML, HTML and plain text files over the Web.

Text Analysis

have not tried this out personally, but this suite of  free DOS programs (from SIL) includes a concordancer, among other things

MicroConcord (Scott & Johns)

free concordancer for IBM PCs running DOS. DOS is faster than Windows but the number of concordance lines is limited to around 1,500, and you can't save a concordance except as a text file. It is very useful for a quick analysis, and may be easier for students to use than WordSmith Tools.

Ultrafind

free tool for Macintosh computers. Not really a concordancer, but can be used as such.

EXMARaLDA
("Extensible Markup Language for Discourse Annotation")

See description under "Text Coding/Annotation Programs/ Text-analysis Tools" below. EXMARaLDA includes a facility for concordancing (SQUIRREL ("Search and Query Instrument for EXMARaLDA") and also ZECKE ("Ziemlich einfaches Konkordanzwerkzeug fr EXMARaLDA") for searching transcribed and annotated phenomena in an EXMARaLDA corpus), but I've not yet used these  myself, so can't comment.

Xaira

(download page here)

a general purpose XML-aware search engine (Windows platform) that will operate on any corpus of well-formed XML documents as well as plain text files (best used with TEI-conformant documents); Unicode-compliant, so works with any language provided the relevant Unicode font is installed on the system. Originally developed at OUCS for use with the British National Corpus.

Emdros 

a text database engine for analyzed or annotated text; supports storage and retrieval of any kind of text plus annotations/analyses of that text. Linguistic analyses are its primary target, and here syntactic analyses are in focus (although other linguistic levels are supported, too). It excels in storing and querying structured data, supporting multiple hierarchies of embedding over the same text. Its powerful query language is built around sequence and embedding as the primary structuring operations. It implements the EMdF database model and the MQL query language.

LEXA

corpus-processing software for DOS (tagging, lemmatising, concordancing)

Xkwic/CQP or IMS Corpus Workbench (CWB)

Excellent corpus query system (my personal favourite) for SunOS 4.1.x, Solaris 2.x/Linux; powerful (full regular expression searches). Fast (indexed) concordancer with both command-line (including batch mode) & X-windows interface; Free for educational use.[Query Syntax & Examples here]
Drawbacks
: Steep learning curve for beginners and non-UNIX/Linux initiates; corpora need to be pre-indexed (can be complicated for marked-up texts); very limited SGML awareness.

Try a web-interfaced demo: Search LOB/COLT with IMS Corpus Work Bench (N.B. web interfaces to CQP/CWB limit its full functionality. You'll need the stand-alone version to get the whole enchilada).

Phrase Context (Hans J. Klarskov Mortensen)

Commercial software. Another concordancer. Can output to XML..

Concordance (R.J.C. Watt)

concordancer for Windows; has facility for publishing concordances to the web; supports non-European character sets (inc. Chinese, Japanese & Korean).

TXTANA 

Commercial software. Concordancer that handles Japanese texts as well as English

Corpus Wizard (or here)

[Doesn't seem to be maintained anymore] Commerical software. Concordancer for Windows with regular expression searches, Japanese language support.

SARA (SGML-Aware Retrieval Application) 

old search engine that was distributed with the first version of the British National Corpus (BNC). Not user-friendly at all (you'd use the interface only if you didn't have access to BNCWeb, which is much more user-friendly and has more functionalities too).

Web-browser-based Concordancers (some are linked to a specific corpus (e.g. the BNC), some can be used with your own texts)

BNCweb (CQP edition)

(Free access via Lancaster Universitys server (restricted if you dont have a licence))

 

The most powerful and user-friendly interface to the British National Corpus (XML World Edition): a browser-based tool for exploring the BNC; formerly linked to the SARA engine, but now uses CQP and MySQL as the back-end systems. Incorporates genre categories as set out in my BNC Index.

There is a manual/textbook that accompanies this tool: Hoffmann, Sebastian, Evert, Stefan, Smith, Nicholas, Lee, David & Ylva Berglund Prytz. (2008). Corpus Linguistics with BNCweb: A Practical Guide. Frankfurt am Main: Peter Lang. (Publishers site is here.)

Youll need to have a licence to a full installation of the BNC World Edition (best installed on a Unix/Linux/Mac OS X system) to get full access, but limited access is available to the general public at the Lancaster site. (Zrich's restricted login page is here; an (out-of-date) on-line manual is also available.)

BYU-BNC

(Mark Davies)

allows word-, phrase- or part-of-speech-based searches of the British National corpus (BNC) with genre-restrictions; allows wildcards and "fuzzy matches". (Formerly called VIEW: Variation In English Words And Phrases)

Compleat Lexcial Tutor

web-based suite of tools for data-driving self-learning (mainly for vocabulary). The online tools allow any reader with an Internet connection to transform any text of interest into a self-teaching text linked to speech, dictionary, concordance, and self-test resources. You paste a text/corpus into one of the tools provided and get results via your browser. Tools include a concordancer, a phrase (n-gram) extractor, VocabProfile (tells you how many words in the text come from the following four frequency levels: (1) the list of the most frequent 1000 word families, (2) the second 1000, (3) the Academic Word List, and (4) words that do not appear on the other lists), a vocab-level-based cloze passage generator and a traditional nth-word cloze builder.

Just The Word (Sharp)

Simplest and most pedagogically accessible tool for ESL/EFL learners based on the British National Corpus (BNC).. Enter a word and get back a bunch of collocations & colligations, sorted into similarity groups. (Based on a 80-million-word subset of the BNC.)

Phrases in English (PIE)

PIE incorporates a database of all 1-6-grams (phrases 1 to 6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC). You can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing work. PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *), e.g., "the * of the", with variants such "the end of the", "the rest of the", "the top of the", "the nature of the". Over the next year PIE will add: (i) Click on an n-gram in the query results to see concordances from the BNC (ii) POS-grams and POS-frames for studying the relative productivity of phrase structures (iii) Filtering by text type (domain, genre, target audience) for contrastive studies (iv) Query by regular expression (currently only wildcards are supported).

SACODEYL Search

A web-based search tool that can be loaded directly with corpora created using SACODEYL Annotator.

Stringnet

(formerly LexChecker)

"a web-based corpus query tool that shows how English words are used. Users submit a word into the query box (like a Google search) and StringNet returns a list of the patterns in which the word is typically used. Each pattern listed for a word is linked to sentences from the British National Corpus (BNC) that show the word occurring in that pattern. The patterns are dubbed 'hybrid n-grams'."

Turbo Lingo  (Danko Sipka)

free web-browser-based concordancer. You can get concordances and frequency lists of entire Web pages (by entering a URL), or by pasting a text into the input box. Also features "1x1phonotactics" and "1x1 lex. combinatorics".

* The above represent just a personal selection. There are many more out there. Kennedy (1998: 258-267) lists and describes quite a number of them. 

* See also: Using the Web as a corpus


 

Concordancing Complements

(including linguistic database programs & tools for treebanked corpora)

Log-likelihood Calculator

(For other more general statistical tools, click here)

Use this to compare results (e.g. linguistic frequencies) from data sets of unequal sizes for statistical significance. Input your frequency figures & out come the LL values (measures of significance). Kindly made available by Paul Rayson of UCREL. Includes references to related technical papers.

Georgetown's "Web Chi Square Calculator" (For other more general statistical tools, click here)

[N.B. For smaller data sets, it is theoretically better (i.e. statistically more accurate) to use the log-likelihood ratio (see LL Wizard above).]  This site allows you to perform the chi square test for statistical significance on-line. Enter the number of rows and columns for your test, and a table will be generated in which you can enter your data

kfNgram

a free stand-alone Windows program for linguistic research which generates lists of n-grams in text and HTML files. Here n-gram is understood as a sequence of either n words, where n can be any positive integer (also known as lexical bundles, chains, wordgrams (or clusters in WordSmith)), or else of n characters, also known as chargrams.

kfNgram also produces and displays lists of "phrase-frames", i.e. groups of wordgrams identical except for a single word.

WebAsCorpus.org Web Concordancer

Search the Web directly for concordances of words and phrases in 34 different languages. Has support for selecting which documents to include in the zipfile, preselection based on document metrics, combining all textfiles into a single document for importing into kfNgram or a concordancer, and conversion from UTF-8 into more widely-supported encodings.

Gsearch

(Edinburgh Universtiy)

a tool for searching corpora by syntactic criteria (even where there is no prior syntactic markup) by processing a query based on a user-definable context-free grammar where the terminals may be regular expressions over elements in the corpus (e.g. words, lemmas, POS-tags) as well as calls to external databases and resources (e.g., WordNet). Tested on Solaris, Linux, MacOS X, and Cygwin under Windows.

LDC's Collocation calculator

Enter two words and get information on their frequencues, collocational mutual information and T-score, based on corpora you choose from a limited set available. [Non-LDC members are further restricted to the Brown (written) corpus, and the TIMIT and Switchboard speech corpora.]

Web-based/on-line word frequency indexers

generate your own frequency lists for your own texts; see list below.

MiniJudge: Judgment Collecting Software

a free, open-source software tool designed to allow syntacticians without any training in psycholinguistics or statistics to perform quick and reliable tests of empirical hypotheses on native-speaker judgments (may also be used for collecting judgments relating to pragmatics, semantics, morphology or phonology). 

For pedagogical software and vocabulary analysis programs, see the "Teaching and Miscellaneous Links" page.

Text Coding/Manual Annotation Programs/Text-analysis Tools & Search Engines

Dexter
(Gregory Garretson)

See this entry above.

EXMARaLDA
("Extensible Markup Language for Discourse Annotation")

a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language; XML-based data formats; Java-based tools; interoperable with software like Praat, ELAN or the TASX Annotator; based on the annotation graph framework (Bird/Liberman 2001); supports several important transcription systems (HIAT, DIDA, GAT, CHAT) through a number of parameterised functions.

Grammar Explorer (grexplorer)

OR the legacy link here.

OR the KPML link here

a tool for learning about the coverage of large generation grammars; aimed currently at grammars written in the systemic-functional style; The operation of the tool is essentially as a coder: you, as the user, should select some sentence, or other grammatical unit, and attempt to `code' that unit using the terms of a grammar. The tool leads you through the grammar presenting the options that are available (&  you can ask for examples exhibiting the relevant grammatical choices); it also tells you the syntagmatic consequences of those choices (i.e., what structure is generated). If your coding is correct, then it should be possible to relate the structure you have generated to the original target unit. The Explorer differs from coding, or text annotation/markup, in that it provides access to the structural consequences of coding. This provides a natural check to the accuracy of any coding carried out. The Explorer also differs from an annotated corpus of examples, in that the examples it shows are all generated with the grammar that it contains.

Kura

(legacy page here)

a multilingual, multi-user, multi-project, open-source linguistic database program especially geared towards language description/ linguists working with fieldwork or manuscript data. Supports the entry, analysis and presentation of linguistic data, be it recordings or manuscripts. All linguistic data is stored in parsed form in a relational database, facilitating quick analysis, and the relations between data can also be stored. Kura consists of 3 main parts: the database with a set of relatively sophisticated components that represent linguistic notions, such as text or lexeme, the desktop client that can be used to enter data and analyses, and the special-purpose webserver, that can present the linguistic data to the outside world. Uses Unicode (currently Basic Multilingual Plane only). Platforms: Windows and Unix/X11 (Windows version might have some limitations and while still free software, some runtime components could lose that status in the future).

The Linguist's Search Engine

web-based search engine which can be used to perform syntactic searches (done graphically via parse trees) on Internet data. Currently available are a three-million-sentence corpus of sentences from the Internet Archive as well as facilities to build and search corpora based around search results from AltaVista queries.

PERL/grep scripts

a summary on the CORPORA list, giving various Perl/grep scripts for producing concordances on the command line, may be found here.

RSTTool

or the related Systemic Coder

RSTTool is a graphical interface for marking up the structure of text. While primarily intended to be used for Rhetorical Structure (cf. Rhetorical Structure Theory (RST): Mann & Thompson 1988), the tool also allows the mark-up of constituency-style analysis, as in the Generic Structure Potential (GSP - cf. Hasan 1984; Halliday & Hasan 1985). Windows, Macintosh, UNIX and LINUX operating systems (requires the pre-installation of Tcl/Tk, a scripting language engine). The Tool consists of four interfaces: Text Segmentation: for marking the boundaries between text segments; Text Structuring: for marking the structural relations between these segments; Relation Editor: for maintaining the set of discourse relations, and schemas. Statistics: for deriving simple descriptive statistics based on your analysis.

 

Systemic Coder is a tool that facilitates the linguistic coding of corpus material, through the prompting of the user for relevant categories. Linguistic features are organised in terms of a systemic network -- an inheritance hierarchy-- to reduce the amount of coding effort. You first define your feature hierarchy, and then prompted to code the segements of the text according to the hierarchy. These codings can then be statistically analysed, either using the built-in comparative statistics programs, or by exporting the codings in a form readable by statistical packages.

 

[My comment: Both these tools do not seem to create output or accept input in an exportable format such as XML. If you know otherwise, please let me know.]

SignStream

a database tool for analysis of linguistic data captured on video. Although designed specifically for working with data from American Sign Language, the tool may be applied to any kind of language data captured on video. A SignStream database consists of a number of utterances, where each utterance associates a segment of video with a detailed transcription of that video.

Systemics 

(Kevin Judd & Kay O'Halloran)

a tool designed to allow efficient and comprehensive discourse analysis of text from the perspective of Systemic Functional Linguistics (SFL); however, as the pre-programmed grammar in Systemics can be modified, this software can incorporate other theoretical perspectives.

SACODEYL Annotator

free (GNU licence) tool for XML-annotating language corpora in a user-friendly way while complying with TEI guidelines. SACODEYL Annotator can: manage multiple corpora; manage the definition of the tags that can be annotated; Extend the annotation tags; annotate different at different levels (tree-based); work with oral and writen texts; show or hide selected annotations.

SysAm

(Macquarie University)

computational tools for managing linguistic systems, analysing texts, and extracting linguistic patterns from a large corpus of text

TATOE

free concordancer and text-analysis/text-markup tool for Windows (TATOE = Text Analysis Tool with Object Encoding). [I can't seem to get this program to work properly for me, but maybe other people will have more luck.]

Tgrep2 (for searching parsed corpora/treebanks)

a search engine for finding structures in a corpus of trees. Used for extracting data from the Penn Treebank corpora of parsed sentences. (Linux program + source code for other platforms)

Serge Sharoff's Perl scripts

scripts for creating XML concordances (monolingual or parallel multilingual) from plain text files (breaks the text in sentences, sentences into words, gives every sentence and word a unique identifier: two files are produced: the concordance and the html file with anchors for sentences) and for annotating and searching/querying them.

TIGERSearch treebank query tool

a specialized search engine for syntactically annotated corpora (treebanks). Features: * linguistically motivated query language (similar to typed feature-based grammar formalisms); * sophisticated graphical user interface (TIGERGraphViewer) for browsing query results; * corpus samplers from PennTreebank, NEGRA, TIGER, DEREKO, Susanne, Christine, Penn-Helsinki Parsed Corpus of Middle English, VerbMobil; * graphical registry tool (TIGERRegistry) for easy corpus administration; * XML-import of corpora Import filters; * XML- and SVG-animation-export of query results; Sample XSLT-stylesheets for the creation of other formats are included.
* available for all major platforms which support Java 1.3: Microsoft Windows, Solaris, Linux, and Mac OS X.

UAM Corpus Tool 

Annotation & corpus analysis tool using stand-off XML markup. Features: (1) Annotation of multiple texts using the same annotation schemes, of your design. (2) Annotation of each text at multiple levels (e.g., NP, Clause, Sentence, whole document) (3) Searching for instances across levels, e.g., finite-clause containing company-np, or future-clause in introduction. (4) Comparative statistics across subsets, e.g., contrasting conversational patterns used by male and female speakers. (5) All annotation is in stored in stand-off XML files, meaning that your annotations can more easily be shared with other applications and allows for multiple overlapping analyses of the same text.

XTrans

a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings

Xlex/www 

see entry for the MTP Xlex/www below.

QDA  Miner (commercial software)

qualitative data analysis software package for coding textual data, annotating, retrieving and reviewing coded data and documents; can work in conjunction with related  software (SIMSTAT and WORDSTAT) for quantitative analyses. For those who do a lot of manual coding in corpus research.

 


Word Lists, Frequency Lists

(freely downloadable) Please mail me if you have lists which you can share with others.

English - Frequency Lists ("word lists" are below)

BNC Frequency lists 

(from the companion web site to the book: Leech, Geoff,  Paul Rayson & Andrew Wilson. (2001). Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London.)

Frequency lists for the whole BNC (version 1), for the spoken versus written components, for the conversational (i.e. demographic) versus task-oriented (i.e. context-governed) parts of the spoken component, and for the imaginative versus informative parts of the written component. Also:  ranked frequency word lists according to parts of speech (e.g. all nouns, all conjunctions) based on the whole BNC corpus (version 1), as well as frequencies for individual part-of-speech tags (e.g. NN1, VDG) based on the BNC Sampler.
Although the frequency lists for this book were based on all 4,124 files of the original BNC version 1 corpus, the text classifications and POS tags used were the updated and more accurate ones implemented in the BNC World Edition.

** For those who want a user-friendly word list (i.e. without frequency figures) based on the entire BNC, I am making one available here (all word forms occurring at least 10 times per million words, alphabetically arranged)

BYU-BNC's Frequency Lists

select any of 70+ registers/genres, and then get a frequency listing for that genre. Just enter "*" (without quotation marks) for a general frequency listing for the selected genre, "[nn1]" for singular nouns in that genre, etc. You can also easily compare word frequency in one genre (or set of genres) against another, e.g. sermons vs. spoken, tabloids vs. broadsheet, medical vs. academic, etc..

Kilgarriff's BNC Frequency Lists

lemmatised frequency list (various formats); unlemmatised, or 'raw', frequency lists (various formats); variances of word frequencies.

Largely superseded by the Leech, Rayson & Wilson (2001) book/web pages, due to text classification & word-tagging errors in the version of the BNC corpus used for these lists. See this note.

Kučera & Francis word list

(from Kucera, Henry & Francis, WN  (1967) Computational Analysis of Present-Day American English Providence, RI: Brown University Press)

word+frequency lists based on the Brown corpus (not disambiguated by parts of speech) may be found at the Brandeis University Computational Memory Lab or at the Psycholinguistic database at Rutherford Appleton Laboratory.

A POS-differentiated version may be found at the ICAME word lists page.

ICAME word lists

word+frequency lists based on the part-of-speeech-tagged LOB (1960s, BrE), untagged FLOB (1990s, BrE), part-of-speeech-tagged Brown (1960s, AmE) and untagged Frown (1990s, AmE) corpora

Map Task Corpus Frequency List (this links to a Corpora List archived mail message, posted by Henry Thompson on 23 Nov 95)

word frequency list for task-oriented dialogue (from the HCRC Map Task Corpus, c. 150,000 tokens of spoken English). (N.B. the link given in that e-mail to the HCRC Map Task Corpus page is wrong. Use this instead.)

Wolverhampton project on 'Language, Computers and Style'

View and compare frequency lists and ngrams for various subcorpora/text genres:

Newspaper articles (from the three domains of business/commerce, science, and politics); Scientific papers (discussing methods; reviewing/evaluating); Business reports and letters to shareholders; Advertising and instructional leaflets; and Instructional texts and manuals (e.g. DIY). No access to actual texts.

English Trigram Frequencies

English trigram frequency table, based on frequency per 10,000 words of the Brown Corpus

OR, for ngrams based on the British National Corpus (BNC), try the Phrases in English (PIE) web site (also includes "character-grams")

Word Lists for Project Gutenberg texts 

(by Ronald Reck)

String frequency reports for 5400+ books (400M words) from Project Gutenberg (read this Corpora List message for details and specific URL pointers)

Various Word Lists and Freq Lists for ESL/TESOL/pedagogical purposes

a set of links collected by the Internet TESL Journal

English - Word Lists/ Stop Lists (no frequencies)

The Academic Word List (AWL)

570 word families reflecting the shared vocabulary of written academic English as used in a wide variety of disciplines (28 in total, 125K words from each) in an Academic Corpus of 3.5m words; selection was based on the principles of range, frequency and dispersion, using a specially compiled academic corpus of journal articles, book chapters, course workbooks, laboratory manuals, and course notes; compiled by Averil Coxhead as a replacement for/update to Xue & Nation's University Word List (UWL).

For Exercise-making tools based on the AWL, see Sandra Haywood's site or Tom Cobb's Compleat Lexical Tutor site.

Basic English word lists

Electronic version of Charles Kay Ogden's classic (1968) basic vocabulary list: Basic English: International Second Language. New York: Harcourt, Brace & World Inc./Orthological Institute.

Billuroglu Neufeld List (BNL)

the Billuroglu and Neufeld List of the most commonly used words in English, defining an improved critical lexical mass from the old GSL and AWL lists. Related site is here.

Lemma List for English  (by Yasumasa Someya)

40,569 words (tokens) in 14,762 lemma groups (Format: worry -> worries,worrying,worried)

Function Words/Stop Lists for English

-       stop list for info-retrieval purposes, from Cornell

-       closed-class words/stop list by Doug Rohde. [Copy and paste from this CORPORA-list mail message]

Longman Defining Vocabulary

I am making available an expanded & slightly extended word list from the back of Longman Dictionary of Contemporary English. 1987. (2nd ed). This represents the 2000+ controlled vocab used by the dictionary in its definitions. I have expanded all the stems (e.g. walk) to include inflected forms (walks, walked, walking) and a few uncontroversial derived forms (e.g. awkwardly, from awkward). The list is available as a Microsoft Word document, with my notes at the top of the file. If you use the list, please reference my article: Lee, David. (2001). Defining core vocabulary and tracking its distribution across spoken and written genres: Evidence of a gradience of variation from the British National Corpus. Journal of English Linguistics, 29(3), pp. 250-278.

Moby project resources

Grady Ward's free word lists & texts. Moby Hyphenator: 185,000 entries fully hyphenated; Moby Language: Word lists in five languages; Moby Part-of-Speech: 230,000 entries fully described by part(s) of speech, listed in priority order; Moby Pronunciator: 175,000 entries fully International Phonetic Alphabet coded; Moby Shakespeare, the complete unabridged works of Shakespeare; Moby Thesaurus: 30,000 root words, 2.5 million synonyms and related words; Moby Words (English): 610,000+ words and phrases.

Ogden's Basic English word list

everything to do with Charles Kay Ogden's 1930s classic Basic English.

West's (1953) General Service List (GSL)

in electronic format, as entered by Bauman & Culligan; Michael West's (in)famous (+ outdated and skewed) set of 2,000+  words selected to be of the greatest "general service" to learners of (written) English. This version ranks the words by their frequency in the Brown Corpus (1960s written American English).

See also the Extended Version of A General Service List of English Words (Excel format) by James Dickins.

Word Lists from Oxford

(or, alternatively, from CLR)

Lists for quite a number of languages, including Afrikaans, Chinese (pinyin syllables), Croatian, Czech, Danish, Dutch, Esperanto, Finnish, French, German, Hindi, Hungarian, Italian, Japanese, Latin, Norwegian, Polish, Russian, Spanish, Swahili, Turkish, Yiddish. No frequencies given.

The Internet TESL Journal's eclectic word lists/ frequency lists for TESL/TEFL/TESOL

assorted lists for pedagogical purposes (e.g. lists of prefixes, colour-related idioms, irregular verbs, common names, homonyms)

Miscellaneous Word lists

(1) Outpost9: eclectic collection of useful and not-so-useful word lists: surnames, given names, dictionary word lists, etc.

(2) Kevin's Word Lists Page: Words+inflections list, Part-of-speech database, jargon word lists, lists for spell checkers, etc.

Various Word Lists and Freq Lists for ESL/TESOL/pedagogical purposes

a set of links collected by the Internet TESL Journal, including Frank Daulton's List of High-frequency Baseword Vocab for Japanese EFL students (a nearly exhaustive list of high-frequency English words [from the BNC] that correspond to commonly-known Japanese loanwords).

Other Languages: Frequency & Word lists/ Stop lists  (if you have lists for other languages which you can share, please let me know)

See also the section below on "On-line Dictionaries, Machine-readable Lexicons & Related Resources"

** Stop Lists for various languages (e.g. Danish, Dutch, English, Finnish, French, Italian, Norwegian, German, Portuguese, Russian, Spanish

The Snowball web page has stop lists (and stemmers) for various languages. On the web site, just click on the language of interest and look for the link to the stop list.

Alternatively, the ASPSEEK search engine comes with various stop-word lists (for French and about 15 other languages). Once you download and unpack the source code, look in etc/stopwords/

Chinese Frequency Lists & Word lists

Choose from the many links listed on this site.

French Frequency Lists & Stoplist (by Jean Vronis)

as it says. ('stoplist' = a list of very common/high-frequency words that you may want to exclude from searches for some purposes)

German Frequency Lists (samples only, but full lists freely available for non-commercial use by arrangement)

frequency lists on word forms, word-tag pairs, lemma-tag pairs, etc. for German (similar in content and style to those by Adam Kilgariff for the BNC). Obtained using a lexicalised statistical grammar model, trained on 35 million words of newspaper data.

Other German lists: from About.comLeipzigMannheim.

Morph-it! (Italian)

free lexicon of Italian inflected forms with their lemma and morphological features. 568,771 entries, 28,500 lemmas.

Portuguese (Brazilian) Frequency Lists

2000 most frequent words in written (newspapers, academic writing, business texts & fiction) and spoken (conversation, interviews, classrooms, meetings) Brazilian Portuguese

Russian Frequency List
(by Serge Sharoff)

based on a corpus of modern Russian fiction and political texts (more than 35 million words). The list includes about 33000 words which frequency is greater than 1 ipm (instances per million words).  A shorter selection of 5000 most frequent words is also available. The list provides word rank, frequency (per million), part of speech. Some analytical information about the lexical stock is provided, such as coverage of the total language use by word bands, e.g. first 3000 lemmas cover 76.6824% of the total number of word forms. The corpus, tools for working with it, as well as an aligned parallel English-Russian corpus are discussed in: Sharoff, Serge, (2002). Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics. Proc. of Language Resources and Evaluation Conference (LREC02). May, 2002, Las Palmas, Spain

Cronfa Electroneg o Gymraeg (CEG): lexical database and frequency counts for Welsh

a word frequency analysis of 1,079,032 words of written Welsh prose, based on 500 samples of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing (parallel to the Kucera and Francis analysis for American English, and the LOB corpus for British English); has frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.

Word Lists of English, Spanish, Basque, and French translated to Occitan

as it says.

Word Frequency generators and Vocabulary Analysis software

For a quick-and-easy frequency listing/index of words in your own texts, try the following programs. For pedagogical software and vocabulary analysis programs, see the "Teaching and Miscellaneous Links" page. Youman's VMP  (Vocabulary Management Profile) used to work, but perhaps no longer.

AntWordProfiler

A freeware word profiling program (for Windows and Macintosh OS X), similar to Paul Nation's Range program. It compares one or more target texts with vocabulary level lists (e.g. Range baseword lists of the most frequent 1000, 2000, 3000 words of English), and produces tables showing which words in the target file(s) appear in the level lists and which do not. It also generates a set of statistics about the target file(s), including number of types and tokens. AntWordProfiler can also display target files with the words in each level list color coded. These can then be edited, for example, to produce simplified texts used for classroom materials.

Compleat Lexical Tutor

Tools include a concordancer, a phrase (n-gram) extractor, VocabProfile (tells you how many words in the text come from the following four frequency levels: (1) the list of the most frequent 1000 word families, (2) the second 1000, (3) the Academic Word List, and (4) words that do not appear on the other lists), a vocab-level-based cloze passage generator and a traditional nth-word cloze builder.


Tools & Resources for Transcribing, Annotating or Analysing texts (inc. speech or audio-visual)

(N.B. Visit this LDC site for a survey of annotation tools and formats/standards relevant to (speech) corpora, or see the section on standards here)

Alembic Workbench

(Unix & Windows)

a natural language engineering environment for the development of tagged corpora; a suite of tools for the analysis of a corpus, along with the Alembic system to enable the automatic acquisition of domain-specific tagging heuristics; easy to annotate textual data with fully customizable tagsets; among the various methods used to expedite the tagging process is the application of machine learning to bootstrap the human annotation process; provideds evaluation tools to analyze annotated data -- whether for assessing machine information extraction performance, or for measuring inter-annotator agreement for a particular corpus or task

AGTK (Annotation Graph Toolkit)

work pioneered by Steven Bird; 'annotation graphs' are a formal framework for representing linguistic annotations of time series data. Application included in this toolkit are: MultiTrans: transcribing multi-party conversation; TableTrans: observational coding of audio;  TreeTrans: syntactic annotation; InterTrans: interlinear text transcription

ATLAS (Architecture and Tools for Linguistic Analysis Systems)

an architecture targeted at facilitating the development of linguistic applications. The principal goal of ATLAS is to provide an abstraction over the diversity of linguistic annotations. The abstraction, which expands on Bird and Liberman's Annotation Graphs, is able to represent complex annotations on signals of arbitrary dimensionality.

CECIL, Speech Analysis Tools, Signalyze, and FindPhone

Four programs mentioned in the Lawler & Dry book

CLaRK 

an XML-based system for corpora development and it includes an Unicode XML Editor, XPath language for navigation in XML documents, XSLT engine for tranformation of XML documents, Cascaded Regular Grammars, Constraints over XML documents, Tokenizers, Concordance tool, Extract, Remove and other tools. The system is implemented in JAVA.

CMU Pronouncing Dictionary

a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their ASCII phonemic transcriptions.

Cool Edit 2000 

Software for digitizing speech (not free)

CSLU Speech Toolkit 
from  the Center for Spoken Language Understanding (CSLU) 

a complete set of free tools for collecting and transcribing speech. Includes an interactive speech display program (Speech View) which allows the user to align transcripts with the sound files; a complete course in spectrogram reading and acoustic phonetics; a speech recognition engine; a speech synthesizer based on the Festival architecture; ascii encoding for phonetic transcriptions.

ELAN (EUDICO Linguistic Annotator)

an annotation tool that allows you to create, edit, visualize and search annotations for video and audio data. It was developed at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, with the aim to provide a sound technological basis for the annotation and exploitation of multimedia recordings. ELAN is specifically designed for the analysis of languages, sign languages, and gesture, but it can be used by everybody who works with media corpora, i.e., with video and/or audio data, for purposes of annotation, analysis and documentation.

 

Emu Speech Database System 

A system for managing collections of speech data which supports hierarchical labelling of utterances. Emu is freely available and supports a range of file formats.

GATE (General Architecture for Text Engineering)

is an architecture, framework and development environment for language engineering which can be also used to annotate texts. GATE is a domain-specific software architecure and development environment (SDK) that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. It supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation.

Guidelines for ToBI Labelling

ToBI (Tones and Break Indices) is a system for transcribing the intonation patterns and other aspects of the prosody of English utterances

MATE Workbench

a Java program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for arbitrary sets of hyperlinked XML encoded files.

NITE XML Toolkit

aimed at software developers, to allow them to build the more specialized displays, interfaces, and analyses that are required by end users when working with highly structured or cross-annotated XML data and multimedia data.

PRAAT

(alternative URL here)

free, comprehensive speech analysis, synthesis, and manipulation package; includes general numerical and statistical stuff, is built on a general-purpose GUI (graphical user interface) shell for handling objects, and produces publication-quality graphics. Runs on virtually all platforms (Windows, Macintosh, Unix/Linux, etc.) Mirror sites here and here.

See also: SpeCT - The Speech Corpus Toolkit for Praat

SACODEYL Transcriptor

Transcription tool that can: * Manage multiple videos formats: DIVX,XVID,AVI,MPEG,Quick Time,RM; * Manage multiple audio formats in MP3, WAV, ASF formats; * Use multi-languague support (Unicode); * Import transcriptions from other formats, such as Transana format; * Support metadata information; * Support transcription of spoken language: cuts, comments, trunc words, foreign words, etc.; * Support timestamping linking between video/audio and text.

The output of SACODEYL Transcriptor is used by SACODEYL Annotator.

SignStream

SignStream is a database program for MacOS that facilitates the annotation and analysis of visual language data. It has been designed for study of signed languages and the gestural component of spoken languages, but may be of use for analysis of any video-based data. SignStream is not currently available for Windows or UNIX platforms, but version 3 is being ported to Java to address this issue.

SIL tools

Lots of software relevant to speech data (& field linguistics), including Speech Analyzer (recording & editing speech, pitch tracking and spectrograms).

SoundScriber (Eric Breck, University of Michigan)

free Windows program (associated with the MICASE corpus) that aides in the transcription of digitized sound files. Includes features specifically for transcription: keystrokes to control the program while working in another window (e.g. word processor, SGML editor, etc.); variable speed playback, and a feature called "walking." Walking plays a small stretch of the file several times, then advances to a new piece, overlapping slightly with the previous one (thus facilitating continuous transcription without having to manually pause or rewind). Opens any file Media Player can, including wave audio files (.WAV), Video for Windows files (.AVI), and MPEG Layer 3 (.MP3). Alternative download link is here.

TalkBank software

(Links to various tools supporting different aspects of the process of transcription and analysis)

(i) Transcriber (alternative site here): a tool for assisting the segmenting, labeling and transcribing of speech signals (labeling speech turns, topic changes and acoustic conditions). Requires prior installation of Tcl/Tk

(ii) CLAN (suite of programs aimed at child language analysis)

(iii) AGTK: Annotation Graph Toolkit (toolkit designed to allow programmers to quickly create small applications that conform with the TalkBank Annotation Graph model)

(iv) XML-based Tools (e.g. xCode: a Unicode text editor; able to validate and filter XML through an XSLT sheet and display the editable result as a flat text)

TASX-Annotator

(Bielefeld)

free, cross-platform program (Java-based, released under GNU licence) for the annotation and transcription of video (multi-channel) and audio data. Video and audio playback can be controlled by a foot switch. Different data views are programmed (time-aligned partiture, word-aligned partiture, sequential text view). The system integrates an XSL-T processor (saxon), making it easy to perform on the fly data transformations. TASX thus takes the function of an interlingua. The import of an XML-file is split into two steps: one simply has to define two XSL-T stylesheets. The first transforms the XML format into TASX, the second transforms TASX back into the XML format.

Transana

free program (Windows & Macintosh) for the transcription and analysis of video data. It provides a way to view video, create a transcript, and link places in the transcript to frames in the video. It provides tools for identifying and organizing analytically interesting portions of videos, as well as for attaching keywords to those video clips. It also features database and file manipulation tools that facilitate the organization and storage of large collections of digitized video. Features: import and view MPEG-1 video and MP3 and WAV format audio files; automatically highlights the relevant portion of the transcript while the video plays; a multi-user version, Transana-MU, allows users to share their data and analyses with other research team members via a LAN.

Transcriber
(or LDC mirror here)

free program. A tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions. It is more specifically designed for the annotation of broadcast news recordings, for creating corpora used in the development of automatic broadcast news transcription systems, but its features might be found useful in other areas of speech research.

UCSB Discourse Transcription Software

VoiceWalker (software for stepping through recordings, for easier transcription) and SoundWriter (VoiceWalker + facility for aligning transcripts with sound files via SMPTE time codes). Free downloads.

VOCALE

A tool for the automatic annotation of vocalic and consonantal intervals, based on the probabilistic measurement of relative entropy and a number of phonetic measurements. Vocale takes a wav file as input, then automatically calls up some Praat functions such as creating a spectrogram and gives a Praat label file as output. This can then be used for the calculation of the speech rhythm. The entire programme is open source and can be downloaded.

wavesurfer

free, Open Source tool for sound visualization and manipulation. Runs on virtually all platforms (Windows, Macintosh, Unix/Linux, etc.).

Winpitch

Windows software. Speech analysis and annotation tool, with fundamental frequency and spectrographic display. Prosodic morphing capability through re-synthesis of natural speech.

XED

free XML-text editor for fast keyboarding of well-formed XML.

XML Spy

tool for XML application developers, schema designers, and XSL style sheet creators; XML Schema driven document and content editing for both developers and end-users. See XCES (XML Corpus Encoding Standard) for XML-encoded corpora here

* For extensive speech-technology-related links and technical stuff, visit the Speech at CMU Web Page

On-line Dictionaries, Machine-readable Lexicons & related resources

* See also the Language Archives index at the LDC and my listing of on-line searchable dictionaries on my Teaching & Misc Links page

Roget's Thesaurus as an Electronic Lexical Knowledge Base

Roget's Thesaurus (1911 edition) in Java, designed for Natural Language Processing; includes four examples of NLP applications: (1) detecting lexical chains in text, (2) determining semantic distance between words and phrases, (3) clustering words based on their meaning, and (4) solving a word quiz.

Yourdictionary.com

comprehensive listing of on-line dictionaries. Hundreds of dictionaries for more than 260 languages

ACL SIGLEX Resource Links

A bookmarks page by the Special Interest Group on the Lexicon of the Association for Computational Linguistics. Pretty good listing of lexicons and electronic dictionaries.

ACL NLP/CL Universe List of Dictionaries

links (many dead ones!) to on-line dictionaries, including parallel/multilingual ones

American English Spoken Lexicon (LDC)

a collection of pronunciations captured in individual audio files for more than 50,000 of the most common words in English (words were extracted from newswire and telephone conversation)

Cambridge Dictionary Data (Commercial)

SGML-encoded text files: The text of the Cambridge International Dictionary of English CD-ROM, English Pronouncing Dictionary, the Cambridge Dictionary of American English, the Cambridge International Dictionary of Idioms, the Cambridge International Dictionary of Phrasal Verbs and the Word Routes/Selector series of parallel bilingual mini-thesauri in French, Spanish, Portuguese, Italian, Greek and Catalan, and sound files from the CIDE CD-ROM.

CELEX Database

lexical data stored in three separate databases for Dutch, English, and German. The Dutch database, version N3.1, was released in March 1990 and contains information on 381,292 present-day Dutch wordforms, corresponding to 124,136 lemmata. The latest release of the English database (E2.5), completed in June 1993, contains 52,446 lemmata representing 160,594 wordforms. The German database (D2.5), made accessible in February 1995, currently holds 51,728 lemmata with 365,530 corresponding wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. Furthermore, information has been collected on syntactic and semantic subcategorisations for Dutch.

CLR Catalog 

Consortium for Lexical Research -- Links to tools and resources..

Early Modern English Dictionaries Database (EMEDD)

an on-line searchable database of entries from sixteen early dictionaries, dating from between 1530 and 1657. The sources include bilingual lexicons as well as specialist and 'hard-word' dictionaries. By combining full texts of early dictionaries written over 160 years by lexicographers with varying purposes, the EMEDD is a reference work for English of the Renaissance period. It is designed to make accessible the English-language content of bilingual (English and other languages) and monolingual (English-only) dictionaries, glossaries, grammars, and encyclopedias published in England from 1500 to 1660.

FrameNet

The Berkeley FrameNet project is creating an online lexical resource for English, based on frame semantics and supported by corpus evidence (the BNC). The project has produced two types of data, a collection of approximately 50,000 hand-annotated sentences and a database containing information about frames, frame elements, lemmas and lexical entries. All of this data is distributed as ASCII files with markup that is compatible with both SGML and XML, with accompanying DTDs.

FreeDictionary.com

English, Medical, Legal, and Computer Dictionaries, Thesaurus, Encyclopaedia, Literature Reference Library, and Search Engine all in one.

Hebrew lexicons

Hebrew WordNet aligned with the English WordNet 1.6 (GNU General Public License).

Morph-it! (Italian)

free lexicon of Italian inflected forms with their lemma and morphological features. 568,771 entries, 28,500 lemmas.

Svenska ord (LEXIN)

(from Sprkdata and Sprkbanken (The Bank of Swedish), Department for Swedish,
University of Gteborg)

A Swedish dictionary containing appr. 20 000 lexical units (lexical categories: pronunciation, part-of-speech, inflexion, definition, valency, and linguistic exemples). Available in two formats: (1) web version (access only for Swedish universities): http://spraakbanken.gu.se/lb/lexin/   (2) XML version for language technology purposes: ftp://ftp.spraakbanken.gu.se/pub/reskit/LEXIN.zip

(GNU) Collaborative International Dictionary of English (G)CIDE

an electronic dictionary-in-the-making derived from the Webster's Revised Unabridged Dictionary (1913), with some words supplemented with definitions from WordNet. Caveats: it still contains typing errors and is being proof-read and supplemented by volunteers from around the world; and definitions are >100 yrs old.

WordNet  
(and related databases)

a lexical database for English; an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.

See also EuroWordNet: a multilingual database with wordnets for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian), MultiWordNet (Italian WordNet is strictly aligned with Princeton WordNet 1.6.), and BalkaNet (for six Balkan languages: Greek, Turkish, Bulgarian, Romanian, Czech and Serbian).

An alphabetic version of WordNet 2.0 is available at http://www.clres.com/WordNet.html. There are 143991 entries in this dictionary, with a sense for each occurrence of an entry in a distinct synset. Virtually all information in WordNet has been captured, including the new domain relations, verb groups, and derivational forms.

worldlanguage.com (commercial)

an Internet store specialising in language products for practically all the world's languages.


NLP/Computational Linguistics Resources (incl. taggers, parsers, SGML/XML stuff)

Most of these descriptions are taken from the respective web sites and do not represent my views. For an introduction to parsing methods and types of parser, click here.

ACL NLP-CL Universe (Association for Computational Linguistics)

bookmark site; pointers to more than 1,500 computational linguistics resources on the Web

Apple Pie Parser

 probabilistic syntactic parser (for UNIX and Windows) developed by Satoshi Sekine at NYU.

Bow 

a toolkit for statistical language modeling, text retrieval, classification and clustering -- a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.

CLaRK 

an XML-based system for corpora development and it includes an Unicode XML Editor, XPath language for navigation in XML documents, XSLT engine for tranformation of XML documents, Cascaded Regular Grammars, Constraints over XML documents, Tokenizers, Concordance tool, Extract, Remove and other tools. The system is implemented in JAVA.

N-gram Statistics Package (NSP)

an easy-to-use suite of Perl tools for counting and analyzing word n-grams in text. It provides a number of standard tests of association that can be used to identify word n-grams in large corpora, and also allows users to easily implement other tests without knowing very much about Perl at all. Supports user-defined tokenization using regular expressions, stop lists, and an extensive collection of test/sample scripts.

CMU Statistical Language Modeling Toolkit

a suite of UNIX software tools to facilitate the construction and testing of statistical language models

Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit

tools are used to process general textual data into:

  • word frequency lists and vocabularies
  • word bigram and trigram counts
  • vocabulary-specific word bigram and trigram counts
  • bigram- and trigram-related statistics
  • various Backoff bigram and trigram language models

Dan Melamed's NLP tools

An assortment of tools, including XTAG morpholyzer post-processors for English Stemming, 170 general text processing tools (mostly in PERL5) and 75 text statistics tools (mostly in PERL5)

Downloads at CSTR

Centre for Speech Technology Research, University of Edinburgh

EdinburghTools

mainly XML/SGML and computational resources, including LT POS, a Part-of-Speech (POS) tagger. See also the Edinburgh XML Workshop

EngCG Parser

Constraint Grammar Parser of English, performs morphosyntactic analysis (tagging) of running English text. The parser employs a morphological ("part-of-speech") disambiguator that makes 93-97% of all running-text words in Written Standard English unambiguous while 99.7% of all words retain the correct analysis. The corresponding figures for the shallow syntactic parser are 75-85% and 97-98%. Available from Lingsoft, Inc.

Euralex 2000 Tutorial

Lots of useful links.

Freeling
(TALP Research Center, Universitat Politcnica de Catalunya).

an open-source C++ library providing language analysis services (such as morphological analysis, date recognition, POS tagging, etc.). Provides tokenizing, sentence splitting, morphological analysis, NE detection, date/number/currency recognition and PoS tagging. Future versions will improve performance in existing functionalities, as well as incorporate new features, such as chuncking, NE classification, document classification, etc.

Functional Grammar Workbench (by Juan C. Ruiz-Antn)

language generation/grammar-writing software which allows the user to write grammars for different languages, using rules of the type devised in Simon Dik's Functional Grammar (expression rules, morphological templates and morphological rules), and test these grammars on predicate-argument formulas introduced by the user. The abstract semantic structure of a sentence is represented through a logico-semantic predication, and the surface form for a particular language is then produced by this semantic representation by applying a set of rules (expression rules, placement rules and morphological rules) to that underlying predication. [Windows program] Not, strictly speaking, directly relevant to corpus-based linguistics, but something nice to have anyway.

Infomap NLP Semantic Learning Software

uses a variant of Latent Semantic Analysis (LSA) on free-text corpora to learn vectors representing the meanings of words in a vector-space known as WordSpace. It indexes the documents in the corpora it processes, and can perform information retrieval and word-word semantic similarity computations using the resulting model. Performs two basic functions: building models by learning them from a free-text corpus using certain learning parameters specified by the user, and searching an existing model to find the words or documents that best match a query according to that model. After a model has built, it can also be installed to make searching it more convenient, and to allow other users to search it conveniently.

Linguistica

a Windows program which can be used to explore the unsupervised learning of natural language, with primary focus on morphology. Given an input corpus, it figures out where the morpheme breaks are in the words, and what are the stems, what are the suffixes, and so forth, based on no knowledge whatsoever of the language from which the words are drawn.

Link Grammar Parser

a free syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. Works on a variety of platforms, including Windows.

LT CHUNK 

a syntactic chunk parser from the Language Technology Group at Edinburgh

GATE (General Architecture for Text Engineering)

is an architecture, framework and development environment for language engineering which can be also used to annotate texts. GATE is a domain-specific software architecure and development environment (SDK) that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. It supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation.

Minipar

a free broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. On a Pentium II 300 with 128MB memory, it parses about 300 words per second.

Morphix-NLP (CD-ROM)

a Live CD Linux distribution with a rich collection of Natural Language Processing (NLP) applications, all on a single CD. Includes: Tokenizers (Qtoken, MXTERMINATOR, Chinese word segmenters); POS Taggers (Brill's TBL Tagger, MXPOST, fnTBL tagger, QTag, Tree-Tagger, Memory-based Tagger); Parsers (Collins' Parser, Link Parser, LoPar); Language Modeling Tools (CMU SLM toolkit, Trigger Toolkit, Ngram Statistics Package); Speech Software (Festival Speech Synthesis); System Development Tools (SVM-light, Maxent, SNoW, TiMBL, fnTBL); Other software (WordNet Browser 2.0, Word Concordance program (antconc), unaccent, and others.

Multext  tools

Multext is developing a series of tools for accessing and manipulating corpora, including corpora encoded in SGML, and for accomplishing a series of corpus annotation tasks, including token and sentence boundary recognition, morphosyntactic tagging, parallel text alignment, and prosody markup. Annotation results may also be generated in SGML format. Upon completion, all tools will be publicly available for non-commercial, non-military use.

Natural Language Software Registry

gives a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community.

NPtool

a fast and accurate system for extracting noun phrases from English texts e.g. for the purposes of information retrieval, translation unit discovery and corpus studies

Paai's Text Utilities

collection of programs and Unix-scripts for doing things with text files (lists, bigrams, various statistical measures to do with Information Retrieval.)

QToken

free Java-based tokeniser ("a piece of software that splits a text into its component elements (tokens). These are typically individual words, but also punctuation marks and other symbols which are not normally considered to be words.... This is usually done by inserting separators, either blank spaces or linebreaks, so that subsequent programs (like a parts-of-speech tagger) can easily read in the tokens and process them further")

RASP (Robust Accurate Statistical Parsing)

Part-of-speech tagging and parsing; XML input and output; free for non-commercial use.

Recode

converts files between character sets and usages. It recognises or produces more than 300 different character sets and transliterates files between almost any pair. When exact transliteration are not possible, it gets rid of offending characters or falls back on approximations.

SgmlQL

Full query and programming language for SGML documents. Command-line tools, no GUI available.
Source code available (tested on Sparc/Solaris and i386/Linux).

Sgrep

sgrep (structured grep) is a tool for searching and indexing text, SGML,XML and HTML files and filtering text streams using structural criteria

Shalmaneser

(SHALlow seMANtic
parSER),

a system for automatic sense assignment and semantic role labeling; comes with pre-trained FrameNet classifiers for English and German. Word sense disambiguation for predicates, plus semantic role labeling - Input: plain text. Syntactic processing integrated. Classifiers available: trained on FrameNet data for English and German (System also applicable to other frameworks) - System output can be viewed graphically in the SALTO viewer. System realized as a toolchain of independent modules communicating through a common XML format -- hence extensible by further modules. Interfaces for addition/exchange of parsers, learners, features.

SHARES (System of Hypermatrix Analysis, Retrieval, Evaluation and Summarisation)

an intertextual mechanism for the identification and ranking of documents in terms of their relatedness to one or more exemplar texts. The SHARES approach is novel in taking the degree of Lexical Cohesion (Hoey, 1991) between texts as the primary criterion for document similarity. A hypermatrix structure has been created, which identifies links between repeated words, and bonds between two closely linked sentences, in two texts. According to our hypothesis, links and bonds will be strong between texts which are similar in content, and weak or non-existent between dissimilar texts.

SIGLEX (Special Interest Group on the Lexicon)

 an umbrella for a variety of research interests ranging from lexicography and the use of online dictionaries to computational lexical semantics. Part of ACL. Lexical resources here or here

 SPARSE (Student PARSing Environment) by Michael Covington

intended audience is syntax or NLP students unfamiliar with Prolog (the language in which SPARSE II is written)

 Speech at CMU Web Page

 extensive speech-technology-related links and technical stuff SIGLEX (Special Interest Group on the Lexicon)

SRI Language Modeling Toolkit

 a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation

Statistical Language Modeling Toolkit

a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

Statistical natural language processing and corpus-based computational linguistics

Chris Manning's annotated list of resources; good collection of bookmarks for tools, corpora, etc.

Software for Systemic-Functional Linguistics.

set of tools including sentence generators, based on Systemic-Functional Grammar..

TASX-environment (Time Aligned Signal data eXchange)

a set of tools forming an XML-based environment which enables scientists to set up multimodal corpora. The technical basis of TASX is an XML-based annotation format. TASX is based on the concept that a re-implementation of  functionalities already available in other speech processing software is not necessary. Established speech software such as Praat or ESPS/waves+ do not need to be duplicated. The TASX-environment therefore focuses only on the development of transcoding filters from and into various formats. These include: Praat/freq, Praat/label, ESPS/waves+, ESPS/F0-analysis, Transcriber, annotation graphs stored in XML, SyncWriter and basic text formats. In addition, filters for data import and export of the Exmaralda system are available. Most of these components are implemented in Java, transformations are defined in XSL-T and a smaller number of additional tools is written in Perl (mainly to transform non-XML data).

Ted Pedersen's software page

includes N-gram Statistics Package (NSP), perl scripts for identifying and statistically testing/ranking n-grams (recurring phrases/collocations) in texts. Plus Senseval and WordNet-related packages.

TextTiling by Marti Hearst

(Java implementation by Freddy Choi is here; Perl version by David James is here)

"TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics; a method for partitioning full-length text documents into coherent multi-paragraph units that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well. The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles."

TigerSearch (Treebank search tool)

specialized search engine for syntactically annotated corpora (treebanks), developed for the Tiger Project (German treebank), but in theory can be used on other treebanks. Query language very similar to that for the IMS CorpusWorkBench/Xkwic/CQP. (Windows, Linux, Solaris and Mac OS X)

Unicode web site

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode enables a single software product or a single web site to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

Verbmobil

a mobile translation system for the translation of spontaneous speech (German, English, and Japanese)

XML Corpus Encoding Standard (XCES)

schemas for XML-encoded corpora, covering annotations, aligned data, etc.

XML Spy

tool for XML application developers, schema designers, and XSL style sheet creators; XML Schema driven document and content editing for both developers and end-users. See also XCES (XML Corpus Encoding Standard) for XML-encoded corpora (above).

* Want free XML tools (editors, parsers, browsers, etc.)? Try Free XML tools, maintained by Lars Marius Garshol


Taggers (and tools for other types of annotation---for various languages; mostly free)

Alembic Workbench

(Unix & Windows)

a natural language engineering environment for the development of tagged corpora; a suite of tools for the analysis of a corpus, along with the Alembic system to enable the automatic acquisition of domain-specific tagging heuristics; easy to annotate textual data with fully customizable tagsets; among the various methods used to expedite the tagging process is the application of machine learning to bootstrap the human annotation process; provideds evaluation tools to analyze annotated data -- whether for assessing machine information extraction performance, or for measuring inter-annotator agreement for a particular corpus or task

Arabic language tools

     a set of Arabic processing tools utilizing the Yamcha SVM tools to tokenize, POS-tag and Base Phrase Chunk Arabic text. (for Linux) can be found on Mona Talat Diab's page here.

     Xerox Arabic Morphological Analyser

     Buckwalter Morphological Analyser

     Sebawai and Al-Stem (for Arabic) -- an Arabic Morphological Analyzer and light Arabic stemmer

AMALGAM Tagger by email

(For English)

Free e-mail tagging service. You have a choice among several tagsets (e.g. Brown, LOB, LLC, SEC, POW, ICE).  Emulates several taggers and their tagsets The program is effectively a wrapper for Eric Brill's Rule-based tagger, retrained at Leeds with 8 alternative tagging schemes. The tagger works by reading in the lexicon, bigram lists and rules from external files.

AUTASYS

(For English)

a menu-driven automatic tagging and lemmatising system that analyses English texts at word-class level with the Lancaster-Oslo-Bergen (LOB) tagset, the International Corpus of English (ICE) tagset, and the skeleton tagset (SKELETON), which is the set of base tags from ICE without features. The tagged text can be subsequently lemmatised (reduced to base forms).

Birmingham's E-mail Tagging Service

Free e-mail tagging service for short texts. Send text as mail to: tagger@clg.bham.ac.uk. Tagset used (similar to the Brown/LOB/Penn set) is listed here.

Brill Tagger

(trainable for any language)

One of the earliest free taggers. Windows versions here or here. On-line/web implementation for German available from Zurich site here.

ChaSen

(For Japanese)

a free Japanese Morphological analyser/POS-tagger from the Nara Institute of Science and Technology (NAIST)

CLAWS

(For English)

'Constituent Likelihood Automatic Word-tagging System', developed at UCREL, Lancaster University. Not free, but has a web front end (demo) that allows up to 300 words to be tagged for free

dat ('dialogue annotation tool' from the University of Rochester)

a free tool for discourse-level annotation in the DAMSL format (requires Perl version 5.002 or higher and the Perl/Tk
package)

fnTBL

(trainable for any language; UNIX and Windows(Cygwin) platforms)

- free, public domain software designed for large, dynamic classification tasks, such as part-of-speech tagging, base noun phrase chunking or word sense disambiguation, but can be used to perform any classification task with symbolic features. fnTBL improves the running time dramatically compared with the original TBL algorithm proposed by Eric Brill, obtaining a speed-up of up to 2 orders of magnitude, while maintaining the same performance.

- basic NLP tasks for English (part-of-speech tagging, base noun phrase and text chunking) are already trained and are part of the distribution; others (e.g. Swedish part-of-speech) can be downloaded from the web site.

ICTCLAS (POS tagger)

and ICTPROP (parser)

(for Chinese)

Chinese Lexical Analysis System--a Chinese word segmenter and POS tagger developed by the Institute of Computing Technologies, the Chinese Academia, Beijing. An open source version for ICTCLAS is called freeICTCLAS (source code and Linux port is at http://www.nlp.org.cn). 

ICTPROP is a probabilistic Chinese parser, trained using Penn Chinese Treebank (ver1.0); precision & recall both 77%. 

JUMAN

(for Japanese)

User-Extensible Morphological Analyzer for Japanese.

LEMMA3

(for English)

wordclass tagger and lemmatizer for unrestricted German texts. To obtain a free copy of the system, please send a request to dr. Gerd Wille, IKP, University of Bonn, Germany (willee "AT" uni-bonn.de)

LT POS

(for English)

LT POS is a part-of-speech tagger which can handle plain ASCII text and SGML/XML marked-up text. LT POS incorporates a tokeniser which will determine sentence and word boundaries. The LT POS tagger uses a Hidden Markov Model disambiguation strategy. It achieves 95 to 97% accuracy. Indicates Noun Groups and Verb Groups. Has a Demo here.

Machinese Phrase Tagger/ Machinese Syntax (Connexor)

(For English, French, German, Italian, Spanish, Swedish, Finnish, soon also Dutch)

Commercial product based on FDG (functional dependency grammar). The Machinese Syntax parser enriches text (plain text, xml, sgml, html) with functional dependencies that show sentence-level relations and functions between words and linguistic structures. Machinese Phrase Tagger is for light morphosyntactic markup (base forms, morphology and phrasal tags).

NOTE: Company was formerly called "Conexor" (with one "n" instead of two) and the base product was initially called EngCG-2 Tagger. That evolved and became embedded in the English version of FDG Lite and the full FDG. (FDG Lite = EngCG-2 + shallow phrasal tags (starting with "&"); the full FDG also produced functional tags and functional dependencies between words.) EngCG-2 was an extended version of the original ENGCG tagger, which assigned morphological and part-of-speech tags to words in English text. It was based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland.

MBT

(trainable for any language)

a free(?) memory-based part-of-speech tagger-generator and tagger. Memory-based tagging is based on the idea that words occurring in similar contexts will have the same POS tag. The idea is implemented using the memory-based learning software package TiMBL, version 4.3.1. The MBT software package makes use of TiMBL to implement a Part of Speech (POS) tagger-generator. The software consists of two executables: Mbtg to generate a tagger, and Mbt to use a generated tagger on text data. The package contains the code (C++), the Reference Guide, and some demo data. MBT has been applied to Dutch, English, Spanish, Swedish, and German

MORPHY (for German)

Free tool for German tagging and morphological analysis.Downloadable JAVA Version of MXPOST (Maximum Entropy POS Tagger) and MXTERMINATOR (Sentence Boundary Detector). This distribution is compatible with JDK1.3.

Morfette

a tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas & morphological labels, & optionally a lexicon, Morfette learns how to morphologically analyse new sentences, assigning morphological tags & lemmas to words

MTP  (Mnster Tagging Project) and Xlex/www (for any language)

Xlex is a suite of tools (mostly Unix command line tools written in Perl) for linguistic data processing, with an web-based, graphical front-end, Xlex/www. Free licence for non-commercial purposes. Xlex/www includes: tokenizer, segmenter, POS-tagger, index tool, concordance tools (regexp) and collocation tools. The Xlex suite is easily portable to any platform with Perl and a web server. Any browser with frames, CSS, and JavaScript capability can be used as Xlex/www client. The  tools are written in Perl (except the POS tagger, implemented in C++) and normally started from a command line interface and intended for use as filters in Unix-style piped commands. Currently trained for German and English.

MMAX /MMAX2 (or the SourceForge page here)

(Multi-Modal Annotation in XML)

annotation tools that allows stand-off annotation, an arbitrary number of levels of annotation, etc. 

MXPOST

(MaXimum Entropy POS-Tagger)

Downloadable Java Version (compatible with JDK1.3). Also: MXTERMINATOR (Sentence Boundary Detector). 

nb

(Nota Bene)

an SGML-based discourse annotation tool written by Giovanni Flammia in Tcl7.0/Tk4.0 (runs under Windows with Tcl/Tk interpreter)

Persian POS tagger

On-line tagger for Persian (input your own text) based on the Peykare corpus tagset.

PC-KIMMO (for any language)

designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. PC-Kimmo includes descriptions for English, Finnish, Japanese, Hebrew, Kasem, Tagalog, and Turkish. Several related utilities: + KGEN. A rule compiler for PC-Kimmo, written by Nathan Miles of Ohio State University. + KTEXT. A text processor that uses the PC-KIMMO parser to produce a morphological parse of each word in the text. + Englex. A 20,000 entry morphological parsing lexicon of English intended for use with PC-KIMMO and/or KTEXT.

Pizza Chef

The TEI Guidelines define several hundred elements and associated attributes, which can be combined to make many different DTDs, suitable for many different purposes, either simple or complex. With the aid of the Pizza Chef (free), you can build a DTD that contains just the elements you want, suitable for use with any XML processing system.

The Perl-script version by Sebastian Rahtz, maketeidtd, is available here

POSTAG (for Korean)

Morphological Analyzer / POS tagger for Korean, with generalized unknown morpheme handler.

Qtag

(trainable for any language)

a free, portable (can be used in any operating system), stochastic, language-independent word-class/POS tagger (implemented in JAVA). Qtag is language-independent, but there is currently only an English version available. To use Qtag with other languages, you will need to create your own resource file (instructions given).

Stanford POS tagger

the Java-based Stanford Log-linear Part-Of-Speech Tagger

TATOE

free text-analysis/text-markup tool and concordancer for Windows (TATOE = Text Analysis Tool with Object Encoding)

TnT tagger

(by Thorsten Brants; TNT=Trigrams'n'Tags)

statistical part-of-speech tagger that is optimized for training on a large variety of tagged corpora in different languages and virtually any tagset, and incorporates methods of smoothing and of handling unknown words.  Free for non-commercial use.

TOSCA-ICLE Tagger

(for English)

an MS-DOS tagger & lemmatiser developed originally for the ICLE and ICE-GB projects. Has 17 major word classes, + features for subclasses and additional semantic, syntactic and morphological information (total number of different tags is 220).

[The older (and not as refined) TOSCA-LOB tagger is an MS-DOS program which produces output in the LOB (London/Oslo/Bergen) Corpus format. For more info, see here.]

Tree Tagger (Stuttgart)

(trainable for any language)

a language-independent tagger and lemmatiser developed  at Stuttgart. Free for research, education and evaluation (for the Sun-Solaris, Linux and MacOS versions; for the Windows version, you'll need to contact  the owners). Parameter files for tagging English, German, Italian and French are available.

Xerox Tagger (trainable for any language)

implemented in Common Lisp and tested on UNIX and Macintosh. Source code available from the ftp site.

Web-Based TAGGER

a Web interface to the TAGGER program. Enter some English text in the input area and click the "OK" button. Parsed text is returned either in XML format or in an easy-to-read marked form.

Wmatrix

a web-based environment which allows access to some of UCREL's corpus annotation and retrieval tools. All processing is done on the remote web server so users gain access from any platform that provides a browser. Tools included in Wmatrix are CLAWS (part-of-speech tagger), USAS (semantic field tagger) and a lemmatiser. Wmatrix also provides production of frequency lists and statistical comparison of those lists. Wmatrix/Matrix a new kind of method and tool for advancing the statistical analysis of electronic corpora. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data.

Free XML Tools & Software 

Lars Marius Garshol's index of free XML tools

  • Lots more taggers (for various languages) listed here (Stanford site), or consult the Natural Language Software Registry (NLSR, hosted at DFKI)
  • For a brief, readable introduction to automatic POS-tagging, see here  (Linda Van Guilder's overview of tagging systems)

Format conversion Tools

Replace Text (formerly called BK ReplaceEm)

a free text search-and-replace program that operates in batch mode across multiple files at once. Can do multiple search-replace operations per file; supports regular expressions; creates a log file and you can specify output location. Note: Replace Text is no longer supported and has known problems with some Windows 7 installations.

HTML TIDY

Dave Raggett's free tool for fixing HTML mistakes automatically and tidying up sloppy editing into nicely layed out markup (performs wonders on HTML saved from Microsoft Word). Also outputs/converts to XML and XHTML, and can be used to validate, correct, and pretty-print XML-files

OpenJade/OpenSP

The osx program, part of the OpenSP package (a successor to James Clark's sp package) can automatically convert SGML files to corresponding XML files. OpenSP is maintained along with OpenJade

 

HTML code strippers

(for removing HTML tags from a saved web page, to feed into concordancers)

Web2Text

HTML to ASCII text converter. "Unlike most others, however, this one not only has an easy to use graphical interface but it actually produces a nicely laid out text version, and keeps URLs visible. A minimum of post-conversion editing required."

HTMASC

(Shareware)

NOTETAB LIGHT 

(Freeware. A 'Pro' version is also available for purchase.)

StripTags

a basic SGML / HTML tag stripper for Windows by William Fletcher. It removes everything between pairs of < > , so it can fail in those rare cases in which a > is embedded within a comment or an attribute. It also does not translate HTML entities (e.g. "&eacute;" --> ).

Web Snaggers/ Web Crawlers for corpus-building

(for grabbing web pages/entire sites for offline reading/processing)

 

Fonts & Tools for Multilingual Computing

Fonts in Cyberspace

IPA Fonts and fonts for many languages from the Summer Institute of Linguistics. See also the fonts page at the LINGUIST LIST site and the Glasgow University site.

Alan Woods Unicode Resources 

Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications

Font set for Japanese, Chinese and other characters (Konjaku-Mojikyo)

The "Konjaku-Mojikyo", includes about 20,000 Chinese characters defined by Unicode (ISO 10646), and about 50,000 Chinese characters collected in the "Dai Kanwa Dictionary" by Professor Morohashi. Plus: oracle bone inscription, Siddham (Sanskrit) characters, Japanese Kana, Chu Nom (used in medieval times in Vietnam), Shui Script (used by the Chinese ethnic minority group), Tangut (Xixia) Script, symbols and so forth.

 

 

KickKeys (for Windows)

Free downloadable transliteration software that enables you to use  the normal English keyboard to write in various languages.Ships ready with key maps and fonts for French, German/Scandinavian, Italian, Portuguese, Spanish along with Assamese, Bengali, Bulgarian, Belarusian, Hindi, Tamil, Russian and Ukrainian (Cyrillic script). It even supports typing right to left languages like Farsi in English Windows. Allows users to specify their own key mapping, change existing key mappings and work with any font. You can use these features on WordPad, Microsoft Word, Outlook, Outlook Express, Excel, FrontPage, PowerPoint and other common
Windows applications. It also comes with graphical tools that allow you to build keymaps for all other languages and fonts.

IPA pics for the web (Universitat de Lleida)

*.gif picture format IPA symbols for HTML pages

OR James Tauber's page here OR try the IPA-GIF server (UPenn) here

IPAKLICK

a web-browser-based tool (javascript keyboard) that makes it easy to insert strings of IPA-symbols (Unicode) into a text (via the clipboard). The site also contains links to free Unicode fonts that include the IPA-symbols and to 'superlinguistic' names for consonants, in which the coarticulatory and perceptual effects of consonants on vowels are exploited.

IPA-SAM phonetic fonts

free TrueType fonts for Windows. With them installed, you can display phonetic symbols on the screen and print them out in any size. The IPA-SAM character set includes all the symbols of the International Phonetic Alphabet as currently recognized by the IPA. There are three typefaces: Doulos (similar to Times), Sophia (san serif) and Manuscript (similar to Courier, monospaced). All are available in regular, bold, italic, and bold italic

SAMPA/X-SAMPA

(Speech Assessment Methods Phonetic Alphabet)

SAMPA is an ASCII encoding of the phonemes of particular languages, based on the International Phonetic Alphabet (IPA), Alternative site here

Unicode web site

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode enables a single software product or a single web site to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

 

[TOP of this page]

Back to HOME (tiny.cc./corpora)[Bookmarks HOME]

[ If you've surfed in from somewhere else & want to know what this site is about, click the home icon to go to my entrance page ]


This current page was last updated: 08 March 2010 15:52:51
David Lee

If you found this web site useful, or found an outdated link, don't forget to let me know.