|About NameX® The world's most powerful name variant tool|
What is NameX?
NameX® is a technology developed for and used for finding surname and forename variants in databases of persons' names.
Introduction to NameX
NameX is a software plugin for databases and applications that shows spelling variations of people's names and allows simultaneous searching across name variants.
NameX was originally developed for users of online genealogical databases, and was specifically designed around the needs of genealogical researchers. Unlike static name-variant solutions that were developed prior to the Internet, NameX dynamically references the constantly increasing volume of names in The Origins Network's databases to improve accuracy and provide more name variant options over time. NameX both solves a key problem in genealogical research and uses genealogical research to provide the best solution to a generic technical problem.
Differences in the way people's names are spelled are much more common than for everyday words, and this is a particular challenge for software applications. Traditional approaches have principally looked at phonetic variation, where the phonetics are based upon sounds which map to spellings in a particular language such as English. Because most countries, such as the United States, are inhabited by people who originate from all over the world, their names are spelled in ways that often reflect particular pronunciations that are different from English, so phonetic approaches are intrinsically flawed. NameX looks at many different aspects of spellings and uses a combination of a variety of algorithms and reference data to offer outstanding results both in precision and recall.
Genealogical research involves searching historical records about people and poses one of the biggest technical challenges for software designed to look for name variants. By meeting the stringent requirements of the professional genealogist, NameX provides a best-of-breed solution to name searching generally. So while designed initially to meet the needs of genealogists, NameX provides the best available solution to virtually any application involving searching databases of persons' names. NameX currently cross-references nearly 6 million first or last name variants from a reference set of nearly half a billion forename/surname combinations.
What is a name variant?
Database applications that involve the storing and searching for personal data often need to allow for a degree of variation in the spelling of both surnames and forenames.
Variations can be introduced into database applications as a result of typographical errors where letters are interchanged (eg Nobel and Noble), letters are substituted (eg Stevens and Stephens), letters are added (eg Colins and Collins) or removed (eg Clarke and Clark).
Variations can also be introduced as a result of alternate spelling for names with the same or very similar pronunciation (eg Cavenagh and Kavenagh; Ewal and Yule; Sean and Shawn). This is a particular problem where the transcription is from the spoken word.
Forename abbreviations or diminutives introduce further problems and in many cases there is little or no phonetic link (eg William with Wm, Billy, Bill,etc or Elizabeth with Lizzie, E'beth, Elz, Beth, etc).
In addition, we do not want to match known masculine names with known feminine variants and vice versa: eg Alexander should match with Alex and Alexandar but not with Alexandria.
More on name variants
Why do we need another name matching system?
Many techniques have been used to assist with the important problem of matching variant names. However, most of these techniques were developed for general word matching and as a result they are not optimised for name matching. NameX was designed specifically for name matching and achieves significantly higher precision with higher recall than the alternatives.
What is wrong with Soundex and Metaphone?
Soundex is a simple algorithm that transforms any word into a code comprising a leading letter followed by up to three digits. For example, Kavenagh gets the code K152 and this may be matched to other names with the same code such as Kavnaugh and Kavenough. Unfortunately the same code also matches with a huge number of unrelated names such as Kyppins and Koppensteiner. This code also fails to match any name with a different leading letter such as Cavenagh. So while Soundex offers a reasonable level of recall it has a very low precision when matching names.
Metaphone is another algorithm that attempts to normalise words by removing vowels after the first letter and mapping common consonant variations. For example, Kavenagh gets the code KFNK and this matches with Kavnaugh, Kavenough and Cavenagh. Unfortunately the same code also matches many unrelated names such as Gavnik and Kaffanke. While Metaphone is better suited to name matching than Soundex, it is still far from ideal.
Neither Soundex nor Metaphone include any of the following matches for Kavenagh: Kavena, Cavena, Kavanha, Kavan. Also, neither algorithm offers any assistance with the problem of forename abbreviations.
Search our database to compare the results using NameX, Soundex and Metaphone
What algorithms does NameX use?
NameX uses a combination of phonetic and other techniques for name variant identification.
All surnames are limited to 27 characters (a-z and " ' "). Double-barrelled names are split into their component parts and treated independently. All accented characters are converted to their closest matching letter (eg à, â, ä and å are all mapped to " a "); this mapping is used to simplify the thesaurus and ensures that similar names are located without regard to the use of accented characters.
Once a potential match has been identified a weighting is computed as follows:
- All surnames are converted to a phonetic encoding.
- The degree of similarity for potential matches is computed using two different matching algorithms. Both the original and the phonetic versions are compared using the two matching algorithms to produce a total of four match scores.
- The Soundex and Metaphone codes are also computed.
- A final match score is computed from the weighted average of the four match scores already calculated combined with the results of the Soundex and Metaphone encodings.
- For forenames, NameX uses a knowledge of gender associations so that close phonetic matches between the sexes can be avoided (eg John and Joan or Alexander and Alexandria).
Why are weighted pairs important?
Neither Soundex nor Metaphone is able to rank their matching names so that the closer matches can be identified. Therefore an application that uses these techniques has an all-or-nothing decision to make regarding the inclusion of name variants.
NameX assigns a match score to all name variants so that each candidate is offered as a weighted pair and all variants for a given name can be supplied as a ranked set. For example, NameX has generated more than 90 matches for the surname Galbraith including:
By assigning a weight to each name variant the application has the option of limiting a particular search to the better matches thus improving precision at the expense of recall.
- Gailbriath - with a match score of 99
- Gallbreath - with a match score of 97
- Galbrealth - with a match score of 85
- Galbruth - with a match score of 83
- Galgraithe - with a match score of 75
What is meant by Precision and Recall?
Precision measures the percentage of correct names in a match list; 100% indicates that the match list does not contain any invalid names.
Recall measures the percentage of all possible correct names appearing in the match list; 100% indicates that the match list contains every correct name.
In an ideal world it should be possible to achieve 100% precision with 100% recall but in practice, for large volumes of data, this is not possible. In general, higher precision leads to lower recall and vice versa.
The above diagram shows how precision drops off as recall increases by dropping the the NameX match score threshold. The diagram also shows how Soundex and Metaphone compare to NameX with both providing reasonable levels of recall but with poor precision. Whilst the performance of Soundex and Metaphone is fixed for any given name, the range indicated in the diagram shows the expected performance across a spread of names.
When weighted name pairs are available it is possible to tune precision and recall dynamically. For example, point 'a' on the diagram above represents relatively high precision with lower recall and would be achieved by setting a high match score threshold. In contrast, point 'b' on the diagram represents relatively high recall but with lower precision and would be achieved by selecting a low match score threshold.
In all cases NameX provides better Precision than either Soundex or Metaphone.
|Working with NameX
NameX data is organised as a thesaurus of name pairs with weights, and would normally be held in a relational database. Database applications using NameX need simply include a "sub-select" to include matching names from the thesaurus above the selected score threshold.
Alternatively, NameX is available as a Web Service where the interface is provided as an XML transaction over an http connection. You can also use the NameX software to create your own thesauri.
NameX as a Thesaurus
NameX is available as a standard thesaurus (continually updated) currently containing 128 million variants for 2.1 million distinct surnames and 7 million variants for nearly 400,000 distinct forenames.
The names in the standard thesaurus come from all over the world although the majority have European origins. The vast majority of all Anglo-Saxon name variants are included. more
Creating your own thesaurus
We offer a service, using NameX software, to build custom thesauri for specific name collections. more
While normally we build thesauri for customers, where, for example, there are security issues, you can license the NameX software for your private use. more
Using a NameX thesaurus
NameX allows you to vary the "match score" continuously which determines you to trade off precision against recall, affecting how many names are returned. In The Origins Network applications we have selected only two values of the match score threshold, to return either "close variants" or "all variants, but in other applications you may wish to retain more flexibility, eg have a larger selection of fixed threshold values, or allowing the threshold value to be set exactly.
How fast is NameX?
The performance of NameX is dependent upon the speed of the relational database used to hold the thesauri. Since the sub-select can easily be controlled by a clustered index the overhead of fetching the additional variants is normally a small percentage of the overall search time. In our demonstration database NameX returns the list of variants of a given name in under one second.
|More on name variants
Name variants may arise for a wide variety of reasons. Some variants exist only within databases, where a "real" name was incorrectly captured; others are genuine variants of the same original name; others may be termed "partial" variants, where two names with different spellings may be variants of the same name, such that they may have been used by the same person (cf the variety of spellings Shakespeare used himself) or by related persons; names with the same spelling may actually be variants of totally different names.
- true variants
These include spelling variants which date back to a time when there was no consistency in spelling, leading to multiple common forms today. Eg: Smith/Smyth/Smythe; Green/Greene
- names transliterated from other languages
These are really "true" variants, though they are arise for a different reason. For example from Gaelic (Irish and Scottish): O'Cleirigh/O'Clery/Cleary; MacDhomhnuill/MacDonald; from East Europe: Grunfeld/Greenfield. Sometimes variants can arise which effectively look like different surnames, eg MacDonald and MacConnell are both forms of MacDhomhnuill. These former are true variants, since they are alternative versions of the same original name.
- altered spelling of a name by the owner to make it match a particular culture
Names - surnames and forenames are often changed when people from one background and language assimilate into another. Such changes may involve direct translation from one language to another - eg Grunfeld/Greenfield, Neumann/Newman - but were often to a name which had some similarity to the origina: eg Raussman/Roseman, Durckheim/Dirk, Lemmle/Lemon.
In many cultures it is common to add a prefix or suffix to a name to indicate "son of" (or daughter of). These suffixes often lead to variants, where the suffix may be dropped or used almost arbitrarily even by members of the same family; this is often the case with the Irish "O''" - two people, one called O'Donnell and one called Donnell might actually be siblings. The suffix "s" and "son" both indicate patronymics: Hughes = son of Hugh; Williams = son of William. With Welsh names, "ap" and "s" can appear almost interchangeably in patronymics, eg ap William = Williams. With Welsh names, the "ap" often got shortened to "p", leading, for example, to Pritchard for ap Richard, and Pugh, ap Hugh. These variants may not be so relevant for people living today, where Pritchard and Pugh have effectively become different surnames from Richard and Pugh, but in family history and genealogical research it can be very important to recognise that "ap Richard", "Richards", "Pritchard" and possibly also "Richardson" (if the family were very anglicised) are all direct variants, and are all linked directly to "Richard".
- partial variants
For example: Macara/McAra/Maccarra are genuine variants of an old Scottish name. But Macara is also an Italian name. No thesaurus can determine whether "Macara" is Scottish or Italian, and so it is virtually impossible to disentangle variants spellings of the Scottish name from variants of the Italian name.
- names originating in languages which use non-Roman characters sets
With languages which use, for example, Arabic script, or the Cyrillic or Greek alphabets, there is often no defined spelling in English for names originating in that language. For example, the NameX thesaurus contains such Arabic name variants as Al-Amari/Al-Amiri/Al-Ammary/Al-Amri/Al-Amry, Al-Shari/Al-Shahery/Ashari, Hussain/Hussein/Hussian/Hussien/Husseyin/Al-Hussain/Al-Hussayen; and such Russian variants as Tchaikovsky/Tchaikovski/Tchaikowsky/Chaikovsky/Chaikovskiy; etc.
- names transcribed incorrectly from oral pronounciation
This commonly happened when families moved to a different country, and immigration officials wrote the names down. This often led to quite wild variations, where strong accents made it difficult to interpret what was being said, often coupled to illiteracy so that the owner of the name could not check the correctness of what was written.
- accidental variants
"Accidental" variants are one of the biggest problems in large databases - possibly the biggest problem. These are essentially mistakes or errors in the spelling of names, introduced, for example, by incorrect reading of a handwritten original name, or simply by an error in keying or transcription. The frequency of such variants can be estimated from accuracies achieved in data entry. High quality data entry typically has a character accuracy of greater than 99.995%, ie less than one character error in 20,000. So if we assume the average length of a surname to be 8 characters, then we would expect up to one character error in about 2,500 names; in a database containing, say, 5 million names, this would imply that up to 2,000 names might have at least one character wrong. Probably the majority of data capture projects have a lower character accuracy than 99.995%, leading to much higher incidences of incorrectly captured names. The NameX thesaurus is particularly strong in the number of such variants it includes (test it to see).
Wrongly keyed characters include:
- characters adjacent on the keyboard to the correct character, eg "n" for "m" and vice versa, or "o" for "i", so that Wilson becomes Wilsom or Wolson;
- double-keyed chararacters, such that Wilson becomes Wwilson or Wiilson;
- transposed characters, eg "sl" instead of "ls", so that "Wilson" becomes "Wislon".
Accidental variants are extremely common in large databases and particularly in genealogical applications. Consider parish register entries. This data often exists in two forms: the original and the "Bishop's Transcript (BT). The latter should be a copy of the former, but if the former contained a mistake this was sometimes corrected in the BT; or a mistake might be introduced in BT. So if we have a computerised index to such records, there are multiple sources of potential errors - and consequent name variants in such an index: the original written down wrongly; the BT copy differing from the original; a transcribed copy of the original or the BT having been copied incorrectly; a typewritten version of the transcription having an error; data keyed into a computer (or captured via OCR) from the typewritten version having an error. However small the probability of errors at each stage, with all the stages involved it is inevitable that some proportion of the names in the computer database will differ from the original. So if you want to increase the probability of finding most if not all of the entries which may be of interest, a good name variant identification program is essential. And this is so regardless of how consistent the original spelling of the name was.
If the database of names is at all large (eg millions of records), it is generally of high value not only to retrieve all entries of potential interest, but also not to retrieve large numbers of entries where the names are of no conceivable relevance; wading through huge number of irrelevant records may is not only boring, can be costly (if cost is a factor), and may actually mean that some relevant entries are thrown out because of the amount of "clutter". So a means of processing names is highly desirable which allows a trade-off between precision (ie the probability that a name is a relevant variant) and recall, the number of possible variants returned. Increasing precision results in fewer names being returned, but increases the likelihood that the names are relevant; increasing recall will return more names which are unlikely to be relevant, but increases the chances of finding a variant which - though "wild" - might just be the one you want.
- shortened versions of names
Shorter versions of names are particulary common with forenames, and forename diminutives, in most languages are not necessarily merely a contracted versions - and sometimes are actually longer than the original name (eg the Russian forename Ivan has diminutives Vanya and Ivanushka). Forename diminutives often look quite different from the original name, eg Peggy = Margaret; Sandy = Alexander; Billy = William. The NameX thesaurus includes many such diminutives.
Of equal, if not greater, importance in many databases are deliberately abbreviated versions of forename: eg Mgt, Margt, Marg't for Margaret; Eliz, Elizth, Eliz'th, E'beth, etc, for Elizabeth. The NameX thesaurus includes thousands of such abbreviations.
In the case of surnames shortened versions usually involve removal of a patronymic indicator (eg "Mac", "O", "ap", "Fitz").
|The NameX Thesaurus
The NameX thesaurus is continually updated. This is done in two ways:
- As new lists of names become available they are processed to check if they contain new names, ie names not already in the thesaurus. If so these names are automatically paired (where possible) with existing thesaurus entries.
- From many of the data sources we work with at The Origins Network, we obtain or create lists of names variants, which can relate to modern names or to names dating back to the earliest usage of surnames (c.12th century). Often these variants would not have been identified by algorithmic means, so the NameX thesaurus is not simply machine-created; it also relies on our own knowledge of variants. This means that the NameX thesaurus can be used with confidence both in applications involving contemporary and historical data.
NameX is a product developed and owned by Image Partners Limited (IPL).
You can license our NameX thesauri (surname and forename) for incorporation into your own application. NameX data is organised as a thesaurus of name pairs with weights and would normally be held in a relational database. We will provide the thesauri in a format suitable for you to integrate into your application. Database applications using NameX need simply include a sub-select to include matching names from the Thesaurus above the selected threshold. An updated version of the thesaurus will be provided to you periodically.
Contact us for further details
Web Service Licence
NameX is also available as a Web Service, where the interface is provided as an XML transaction over an http connection. The advantage of this approach is that you do not need to build the NameX thesaurus into your own database management system, and you have access to our continually updated thesaurus.
Contact us for further details
We can process your own lists of names, either to create a thesaurus wholly based upon your name lists, or combined with our main thesaurus. You need only provide us with a flat file of the names you wish to have processed and we will return the set of weighted name pairs.
Contact us for further details
NameX is a registered trademark.
If you are interested in using NameX in your own applications, please contact us.
Address: Ecomis Ltd, Simon's Lee, Rackham Road, Amberley, West Sussex BN18 9NR, UK