Teenage Health Freak Corpus: Spelling Errors

As the Teenage Health Freak Corpus consists of unedited typed messages it is inevitable that spelling errors and typos will be present in the corpus. Another feature, which for some analytical purposes pose the same problems as spelling errors, is the deliberate use of abbreviations and acronyms as used in text messages, instant messages and internet forums.

Teenage Health Freak Corpus Links

Volume of Spelling Errors

Step one was to investigate the type of spelling errors found in the corpus and also how frequent they are and therefore how big a problem they may pose for the corpus analysis. In order to do this 50 messages were selected at random from each year in the corpus. This was done using a random number generator in a Python script. The samples were then analysed by hand to identify spelling errors. The results of this stage of the processing can be seen below.

Year	No. of Words	No. of Errors	Percentage Errors
2004	1209	76	6.3
2005	1403	116	8.3
2006	758	89	11.7
2007	898	55	6.1
2008	1000	70	7.0
2009	871	60	6.9
All Years	6139	466	7.6

Assuming these samples are representative of our corpus we could expect around 168,592 words to be incorrectly, or at least unconventionally, spelled. With this number so large further investigation of the spelling errors in the corpus was warranted.

Type of Spelling Errors

Understanding the type of spelling errors regularly encountered in the corpus is important because spelling correction algorithms often work better with some types of errors than others. Therefore finding out about the type of errors we have to deal with could help with the selection and evaluation of algorithms. In order to investigate this errors were first classified into five main classes explained and illustrated in the table below.

Error Class	Description	Examples
Chat-style	abbreviations or acronyms which might be expected in text messages or Instant Messaging	u > you; 4 > for; cuz > because; sum > some
Phonetic	words which can reasonably be pronounced in the same way as the original word and are less likely to be typographical	probarbly > probably; egsisting > existing; marige > marriage
Typographical	errors which are more likely to be caused by mistyping	iam > i am; resulst > results; alchohl > alcohol
Emphasis	deliberate errors made for emphasis (typically additions)	soooo > so; yoooo > yo
Unclassified	errors that don't seem to fit in any of the categories	pencise > penis

The results of the analysis can be seen in the table below. If words include more than one class of error they are counted in each relevant class.

Error Class	Total Occurrences
Typographical	257
Chat-Style	125
Phonetic	83
Emphasis	3
Unclassified	1

Typographical Errors

The largest class of errors in the corpus are typographical errors. Included in this figure are 123 errors which only involve a missing apostrophe. These were included in typographical errors because instances of “Im” and “I'm” will be treated as different tokens by corpus processing software. For many corpus tasks, however, this will probably not be of much concern. Even if these examples are removed typographical errors are still the largest class with 134 examples in our selection.

If we look further into the typographical errors 36 involve errors of space placement.

Space Placement Error	Examples	Count
Deletion	eachother > each other; iam > i am	25
Insertion	when ever > whenever; every thing > everything	9
Transposition	o fmy > of my; wantt o > want to	2

A further 7 are examples of word substitution where the substituted word is not a homophone of the intended word (homophones are included in phonetic errors). Examples of these include “you” > “your” and “my” > “me”.

The remaining typographical errors fall into the categories in the table below.

Error Type	Examples	Count
Letter Deletion	becaue > because; syptoms > symptoms	35
Letter Transposition	lieks > likes; develpoed > developed	18
Letter Insertion	piulls > pills; pregnaunt > pregnant	17
Letter Substitution	mush > much; ma > my	14
Complex Combination	alchohl > alcohol; pregnate > pregnant	5

Chat-Style Errors

A further analysis of the chat-style errors showed that the overwhelming majority are abbreviations rather than acronyms.

There were only two examples of acronyms both occurring at the end of the same message; wb for write back and the more commonly used asap. (Here we are talking specifically about acronyms for chat-related functions rather than things such as BMI for Body Mass Index.)

The abbreviations used tend to fit general patterns or conventions and there is generally a 1 to 1 relationship between abbreviations and target words. In the selection analysed we have examples of:

vowel changes

vowels being missed out of words (jst > just; bt > but; thr > there; rly > really)
vowels and final e being changed for a single vowel (sum > some; lyk > like)
dipthongs changing to a single vowel (duznt > doesn't; frendz > friends; shud > should)

consonant changes

s changing to z even where this does not result in an abbreviation (frendz > friends; itz > it's)
th going to z, d for f (za > the; fink > think; deir > their)
f changing to v (ov > of)
silent h missing (wen > when; wat > what)
final g dropped (aveing > having)
opening consonant dropped (aveing > having)

syllable changes

er changes to a (ova > over; uva > other)
ough shortened (tho > though)

full word changes

numbers being used in abbreviations (4 > for; m8 > mates)
letters standing for words (n > and; r > are; u > you; y > why; bf > boyfriend)
word shortening (brill > brilliant)

two words joined

of appended with a (loadsa > loads of; kinda > kind of)
to appended with a (wanna > want to)
other contractions (dunno > don't know; waza > what's up)

others

please changing to plz
because abbreviated to cus, cuz, cos, coz

This is a fairly small set of messages and there is likely to be more variety in the corpus as a whole. In this selection the most varied abbreviations are found with the word “because” where we have “cos”, “coz”, “cus” and “cuz” even these however are combinations of single features described above. An interesting observation on spelling in general but which is particularly true of the use of chat-style abbreviations is that there is huge difference between messages with some users avoiding chat-style language and others making full use of it. This may reflect the familiarity of the user with instant messaging, forum writing and perhaps text messaging but also reflects the choice of register considered appropriate for addressing medical questions to Dr Ann which some selecting very formal registers and other much more informal.

Phonetic Errors

Phonetic errors have been separated from typographical errors because they each have a different relationship with the target word. In the case of typographical errors the relationship between the typed word and the target word is based on the position of letters on the keyboard or sequences of frequently types letters. With phonetic errors there is a more direct relationship between the typed word and the target word which phonetic based algorithms should be able to handle effectively.

More detailed analysis of the phonetic errors follows the same pattern as that used for typographical errors and the results can be seen in the table below.

Error Type	Examples	Count
Letter Insertion	dissorder > disorder; scruews > screws; drinkes > drinks	36
Letter Substitution	shrivals > shrivels; raisen > raisin; descusting > disgusting	26
Letter Deletion	gaynes > gayness; realy > really; obsesive > obsessive	19
Homophone Substitution	too > to; no > know; band > banned	17
Complex Combination	flemmy > phlegmy; masterbaiting > masturbating; sigerate > cigarette	7
Multiple Letter Substitution	dieing > dying; egsisting > existing	3

Emphasis

The examples of emphasis in the messages used for analysis only involve the word “so” being emphasised with the addition of several “o”s and the word “yo” being emphasised in the same way. In the larger corpus however examples have been seen with involve “please” being extended on the “e”. This is by no means the only, or even the most common, way that emphasis is expressed in the corpus. Capital letters are very frequently used as are repeated exclamation marks and question marks, repeated words in particular the word “please” are also found.