School Intranet

Corpus Linguistics Workshop: Talk by Paul Rayson on semantic tagging of Early Modern English

Date(s)

Wednesday 25th March 2015 (13:00-14:30)

Description

The Corpus Linguistics Workshop will be hosting an invited talk next Wednesday, March 25th: we are very pleased to welcome Paul Rayson, from Lancaster University.

Paul Rayson is director of the UCREL Research Centre and a senior lecturer in the School of Computing and Communications. His methodological contributions are in the areas of key semantic domains and corpus analysis software – Wmatrix was in fact developed by Paul. His talk will be entitled: “Can you adapt a modern semantic tagger for Early Modern English corpora?” (see the full abstract below).

The talk will take place on Wednesday, March 25th, Trent A35, from 1:00 pm (please note the time and day different from usual!). There will be coffee and tea at the end of the talk. The event is open to everyone but, due to catering and seating arrangements, please let us know if you’re thinking of attending by sending an email to lorenzo.mastropierro@nottingham.ac.uk or viola.wiegand@nottingham.ac.uk.

Abstract: 'Can you adapt a modern semantic tagger for Early Modern English corpora?'

In this talk, I will present joint research from the Samuels project (www.gla.ac.uk/samuels/) where we are carrying out a number of case studies on two very large corpora around 1-2 billion words each: (a) Early English Books Online (EEBO) Text Creation Partnership (TCP) consisting of over 53,000 transcribed books published between 1473 and 1700 and (b) two hundred years of UK Parliamentary Hansard made up from over 7 million files. In this talk I will describe the changes that we've made to the Wmatrix tag wizard in order to address historical spelling variation and meaning change over time. I will describe the latest version of the VARD (Variant Detector) software which allows us to pre-process historical corpora and match modern forms to historical variants, thus improving tagging accuracy. In order to have a historically valid taxonomy, we have adopted the Historical Thesaurus of English (developed at the University of Glasgow) and the Oxford English Dictionary, thus helping us improve methods for the automatic semantic analysis of historical texts. The Historical Thesaurus contains 793,742 word forms arranged into 225,131 semantic categories. The combination and scale of the corpora and the size of the taxonomy pose significant computational challenges for existing retrieval methods (Wmatrix) and annotation software (USAS) and I will describe our current solutions to these problems.

Centre for Research in Applied Linguistics

The University of Nottingham
Nottingham
NG7 2RD

telephone: +44 (0) 115 951 5900
fax: +44 (0) 115 951 5924
email: cral@nottingham.ac.uk

Corpus Linguistics Workshop: Talk by Paul Rayson on semantic tagging of Early Modern English

Centre for Research in Applied Linguistics

Legal information

Get social