Scientific names are critical metadata elements in biodiversity. They are the scaffolding upon which all biological information hangs. However, scientific names are imperfect identifiers. Some taxa share the same name (e.g. homonyms across nomenclature codes) and there can be many names for the same taxon. Names change because of taxonomic and nomenclatural revisions and they can be persistently misspelled in the literature. Optical scanning of printed material compounds the problem by introducing greater uncertainty in data integration.
This resolution service tries to answer the following questions about a string representing a scientific name:
- Is this a name?
- It is spelled correctly?
- Is this name currently in use?
- What other names are related to this name (e.g. synonyms, lexical variants)?
- If this name is a homonym, which is the correct one?
Matching Process
1. Exact Matching
Submitted names are checked for exact matches against names in specified data sources or against the entire resolver database. If "resolve_once" is specified in the API, found names are immediately removed from the process instead of being resolved against all specified data sources. This significantly accelerates matching and can be used to discover if a string is in fact a name.
2. Exact Matching of Canonical Forms
Name strings are often supplied with complex authorship information [e.g. Racomitrium canescens f. epilosum (H. Müll. ex Milde) G. Jones in Grout]. The Global Name parser strips authorship and rank information from names [e.g. Racomitrium canescens epilosum], which makes it possible to compare the string with other variants of the same name. Resulting canonical forms are checked for exact matches against canonical forms in specified data sources or in the entire resolver database. All found names are removed from the process at the completion of this step.
3. Fuzzy Matching of Canonical Forms
Mistakes, misspellings, or OCR errors can create incorrect variants of scientific names. Remaining canonical forms generated from the previous step are fuzzily matched against canonical forms in specified data sources. We use a modified version of the TaxaMatch algorithm developed by Tony Rees. After this step all found names are removed from the process.
4. Exact Matching of Specific Parts of Names
Some names are recognized by the parser as infraspecific names, which were not found during previous steps of the process. This may be because the name is unknown to the resolver database. Sometimes a 'junk' word is wrongly included and the parser may recognize it as an infraspecific epithet. The algorithm extracts specific canonical forms from names recognized as infraspecific and tries to match this subset of names against datasources or the entire resolver database. For example, "Pardosa moesta spider" will be cleansed and matched as "Pardosa moesta". All found names are removed from the process prior to proceeding to the following steps.
5. Fuzzy Matching of Specific Parts of Names
Remaining names to be processed are fuzzily matched then removed.
6. Exact Matching of Genus Part of Names
Remaining names in the process as well as all remaining binomial canonical forms are reduced to the genus part and matched against the data sources or the entire database.
Taxonomic Context
If the "with_context" parameter is set to "true", the overall taxonomic group of all matched names is collected throughout the process. Scores for possible homonym matches are down-weighted if the resolved names do not belong to the overall taxonomic group of the queried list. If this is undesirable behavior, this parameter may be set to "false".
Confidence Score
Matched names fall into several categories. For example, if the name Aotus gets perfectly matched as a plant genus, this may be incorrect if the queried name actually refers to a genus of monkey. Another example is poor fuzzy matching. The name Afina can be fuzzily matched to the genus Alina in the Order Lepidoptera. Matches of trinomial or binomial names have greater accuracy. Matching of authorship information further increases the likelihood of a correct match. However, different authorship does not always mean different taxonomic meaning. For example, Monochamus galloprovincialis (Olivier, 1795) and Monochamus galloprovincialis Secchi, 1998 both refer to the same species, where the former indicates the original author of the name and the latter is merely a reference to the name. The name resolver produces a "confidence score" to accommodate all these potential issues. The score is produced from a curvilinear plot of weighted decisions.
We start at 0 on the x-axis and assign positive values for events that increase the probability score, and negative values to events that decrease it. For example, an exact match of a binomial name increases the probability significantly, so we adjust the slider 3 points to the right with a corresponding score of 0.988. However, if the authorship of the name did not get correctly matched, we adjust the slider 2 points to the left, to a corresponding score of 0.75. We try to map confidence level the with resulting scores. For example, 0.5 means neutral confidence whereas 0.99 mean high confidence.