Geographic Name Resolution Service
The Geographic Name Resolution Service (GNRS) corrects and standardizes political division names, in accordance with the GeoNames database of geographic place names.
Why a GNRS?
Correct political division names are essential for assessing the accuracy of geographic coordinates. One of BIEN's assessments verifies that the coordinates of an observation fall within its declared political divisions. Points falling outside are labelled as inaccurate so that they can be excluded from analyses. Such "bad" observations are also excluded from use as inputs for species range models.
Coordinates can be assessed only if their declared political divisions match the names of political division polygons in the GADM (Database of Global Administrative Areas) reference database. Unfortunately, political division name spellings are notoriously fickle. Variation due to misspellings, abbreviations, alternative codes, local usages, and language differences mean that a large fraction of observation records within the BIEN database have political divisions that do not match to a GADM name. Indeed, even reference databases such as GADM and GeoNames sometimes use different names for the same political division!
Such mismatches result in a large volume of otherwise "good" data being discarded because its accuracy cannot be verified. For this reason, correcting and standardizing political division names is as important for biodiversity data quality as correcting and standardizing taxonomic names.
The GNRS accepts as input a UTF-8 CSV text file consisting of a header plus one or more one- to three-part political division names. Each name consists of a country (first level), a state_province (second level) and county_parish (third level). Only country name is always required. A value for state_province is required only if county_parish is also included. A user-supplied unique ID (optional) may also be included for joining back to the original data where accents or hidden characters interfere with joins. Fields must be separated with commas, and may also be surrounded with double quotes. Double quotes are required around any values which contain commas. The first line of the file must be the header containing the column names.
|user_id||User-supplied unique ID||No|
|state_province||State/province name||No (yes if followed by county_parish)|
|county_parish||Third political division name||No|
Valid input file examples
Example 1 (with user_id):
2,U.S.A., Arizona, Pima County
3,United States, Illinois
6,México,Oaxaca,Municipio de Ixtlán
Example 1 (no user_id):
,U.S.A., Arizona, Pima County
,United States, Illinois
,México,Oaxaca,Municipio de Ixtlán
The matching process
Each component of the political division triplet is compared against the GeoNames database using exact and fuzzy matching. For each name component, matches are attempted against a variety of name variants until a match is found or all matching options are exhausted. Exact matches are used first, beginning with standard English-language names, followed by accented and unaccented (plain ASCII) versions of alternate names in a variety of languages, including the official language of the political division itself. Matching is also attempted against various standard and non-standard abbreviations and codes, such as 2- and 3-character ISO codes, FIPS codes and postal abbreviations. If all exact matches fail, then a second round of fuzzy matching is performed against standard and alternate full names. To avoid excessive false positives, codes and abbreviations are not fuzzy matched.
The matching process is hierarchical. Once a country has been found, only second-level names (i.e., state/province names) within that country are searched. If no match is found at a given level, lower levels are not searched and matching stops.
For each of the three name components, the GNRS reports the match method used ("exact" or "fuzzy"), the name variant matched ("standard name", "alternate name", "ascii name", "iso code", etc.), and, if fuzzy matching was used, the match score. In addition, an overall assessment of match success is provided ("full match", "partial match","no match"), along with the level ("country", "state_province", "county_parish") of the submitted and matched names.
The GNRS performs fuzzy matching using the trigram algorithm, as implemented in PostGreSQL extension pg_trm by the function "similarity". Trigram match scores >= .5 (50%) are considered matches, and scores <0.5 are considered non-matches. The 50% match score threshold was chosen by inspection as providing the best mix of true versus false positives and negatives.
Due to limitations of the PostGreSQL pg_trm extension, fuzzy matching is only effective for latin-based languages. Political division names in languages such as Japanese and Arabic can be matched exactly but not approximately. Fortunately, the GeoNames database contains a large number of name variants (if not misspellings) in non latin-based languages.
The GNRS reference database
Although GNRS database is based largely on GeoNames, the original tables are restructured into country, state_province and county_parish tables to remove unneeded content (for example, names of cities and geographic features) and to improve search speed. Additional content has been added to include widely-used name variants not present in GeoNames. These include HASC codes, common postal abbreviations, and alternative formulations of second- and third-level names with and without rank labels such as "State", "Province", "County", "Municipality", etc.
Rank-labelled variants of second- and third-level political division names are searched in multiple languages in a variety of formulations that include and exclude prepositions and articles, as appropriate to each language. For example, the Mexican state of "Sonora" would be searched as "Sonora", "Estado de Sonora", "State of Sonora", "Sonora State", "État de Sonora", etc. These searches are performed on both accented and non-accented versions of each name. All searches are case-insensitive.
Currently, the GNRS operates in batch mode only. This mode requires shell access to the BIEN servers and is therefore for internal use only. Planned APIs and a web user interfaces will make the GNRS services available to the public for general use. Stay tuned.
For definitions of fields returned by the GNRS, see the GNRS Data Dictionary. Understanding the meaning of these fields is important as many are displayed within the BIEN analytical database. The raw GNRS output from processing all BIEN observations is also available for inspection within the BIEN database.