Geographic Name Resolution Service
The Geographic Name Resolution Service (GNRS) corrects and standardizes political division names in compliance with the GeoNames database of geographic place names.
Why a GNRS?
Correctly-spelled political division names are essential for assessing the accuracy of geographic coordinates. One of BIEN's key validations verifies that the coordinates of a species observation fall within its declared political divisions. Points falling outside are labelled as errors so that they can be excluded from analyses. Such erroneous observations are also excluded from use as inputs for species range modeling.
Coordinates can be assessed only if their declared political divisions match the names of political division polygons in the GADM (Database of Global Administrative Areas) reference database. Unfortunately, political division name spellings are notoriously fickle. Variation due to misspellings, abbreviations, alternative codes, local usages, and language differences mean that a large fraction of observation records within the BIEN database have political divisions that do not match to a GADM name. Indeed, even reference databases such as GADM and GeoNames sometimes use different names for the same political division! For this reason, GADM political division names must also be standardized to GeoNames prior to using their associated spatial attributes for point-in-polygon validation of geocoordinates and associated political divisions.
Misspelled political division names can result in a large volume of otherwise "good" data being discarded because its accuracy cannot be verified. In assessing the quality of biodiversity data, correcting and standardizing political division names is just as important as correcting and standardizing taxonomic names.
The GNRS accepts as input a UTF-8 CSV text file consisting of a header plus one or more one- to three-part political division names. Each name consists of a country (first level), a state_province (second level) and county_parish (third level). Only country name is always required. A value for state_province is required only if county_parish is also included. A user-supplied unique ID (optional) may also be included for joining back to the original data where accents or hidden characters interfere with joins. Fields must be separated with commas, and may also be surrounded with double quotes. Double quotes are required around any values which contain commas. The first line of the file must be the header containing the column names. For more details, see GNRS Input Format.
|user_id||User-supplied unique ID||No|
|state_province||State/province name||No (yes if followed by county_parish)|
|county_parish||Third political division name||No|
Valid input file examples
Example 1 (with user_id):
2,U.S.A., Arizona, Pima County
3,United States, Illinois
6,México,Oaxaca,Municipio de Ixtlán
Example 1 (no user_id):
,U.S.A., Arizona, Pima County
,United States, Illinois
,México,Oaxaca,Municipio de Ixtlán
For definitions of fields returned by the GNRS, see the GNRS Data Dictionary. Understanding the meaning of these fields is important; many are also stored within the BIEN analytical database. The raw GNRS output from processing all BIEN observations is also available for inspection within the BIEN database.
The matching process
Each component of the political division triplet is compared against the GeoNames database using exact and fuzzy matching. For each name component, matches are attempted against a variety of name variants until a match is found or all matching options are exhausted. Exact matches are used first, beginning with standard English-language names, followed by accented and unaccented (plain ASCII) versions of alternate names in a variety of languages, including the official language of the political division itself. Matching is also attempted against various standard and non-standard abbreviations and codes, such as 2- and 3-character ISO codes, FIPS codes and postal abbreviations. If all exact matches fail, then a second round of fuzzy matching is performed against standard and alternate full names. To avoid excessive false positives, codes and abbreviations are not fuzzy matched.
The matching process is hierarchical. Once a country has been found, only second-level names (i.e., state/province names) within that country are searched. If no match is found at a given level, lower levels are not searched and matching stops.
For each of the three name components, the GNRS reports the match method used ("exact" or "fuzzy"), the name variant matched ("standard name", "alternate name", "ascii name", "iso code", etc.), and, if fuzzy matching was used, the match score. In addition, an overall assessment of match success is provided ("full match", "partial match","no match"), along with the level ("country", "state_province", "county_parish") of the submitted and matched names.
The GNRS performs fuzzy matching using the trigram algorithm, as implemented in PostGreSQL extension pg_trm by the function "similarity". Trigram match scores >= .5 (50%) are considered matches, and scores <0.5 are considered non-matches. The 50% match score threshold was chosen by inspection as providing the best mix of true versus false positives and negatives.
Due to limitations of the PostGreSQL pg_trm extension, fuzzy matching is only effective for latin-based languages. Political division names in languages such as Japanese and Arabic can be matched exactly but not approximately. Fortunately, the GeoNames database contains a large number of name variants (if not misspellings) in non latin-based languages.
The GNRS reference database
Although GNRS database is based largely on GeoNames, the original tables are restructured into country, state_province and county_parish tables to remove unneeded content (for example, names of cities and geographic features) and to improve search speed. Additional content has been added to include widely-used name variants not present in GeoNames. These include HASC codes, common postal abbreviations, and alternative formulations of second- and third-level names with and without rank labels such as "State", "Province", "County", "Municipality", etc.
Rank-labelled variants of second- and third-level political division names are searched in multiple languages in a variety of formulations that include and exclude prepositions and articles, as appropriate to each language. For example, the Mexican state of "Sonora" would be searched as "Sonora", "Estado de Sonora", "State of Sonora", "Sonora State", "État de Sonora", etc. These searches are performed on both accented and non-accented versions of each name. All searches are case-insensitive.
There are currently three interfaces for the GNRS.
- GNRS Batch Application. The core GNRS application. Collaborators with shell accounts on our BIEN servers can access the GNRS directly via the unix command line.
- GNRS API. The GNRS API provides public access to the GNRS via a standardized interface that can be invoked using a variety of languages. General instructions and an example of how to access the GNRS API in R are provided here.
- RGNRS. The RGNRS R package is a wrapper over top of the GNRS API. It provides a variety of functions that simplify working with the GNRS for users familiar with R. The RGNRS package is not yet on CRAN but can be downloaded from the RGNRS GitHub repository.
An additional option is to install your own instance of the GNRS by following the instructions in the GNRS GitHub repository. The GNRS runs on unix-type operating systems (preferably Ubuntu; not tested on other *nix flavors) and requires PostGreSQL 10 or later. PHP is needed only if you will also be using the API. Detailed instructions for installing GNRS Batch are provided here. You will also need to build your own GNRS database (described here) and a local instances of GeoNames (more information here and here).