Why do we need a GNRS?
Correctly-spelled political division names are essential for assessing the accuracy of geographic coordinates. A key BIEN data quality validation verifies that the coordinates of a species observation fall within its declared political divisions. Points falling outside are labelled as errors so that they can be excluded from analyses and species range modeling.
Accuracy of geocoordinate locations can be assessed only if the associated political division name matches the name of a corresponding political division polygon within the reference GADM (Global Administrative Areas) database Unfortunately, political division name spellings are notoriously variable, especially for lower political units such as county and municipality. Misspellings, abbreviations, codes, local usages, and language differences mean that a large fraction of observation records within the BIEN database have non-matching political division names, even when the coordinates are correctly located. Indeed, different reference databases sometimes use different names for the same political division! For this reason, GADM political division names must also be standardized to GeoNames prior to using their associated spatial attributes for point-in-polygon validation of geocoordinates and associated political divisions.
Misspelled political division names can result in a large number of good data points being discarded because their accuracy cannot be verified. In assessing the quality of biodiversity data, validating political division names can be just as important as correcting and standardizing taxonomic names.
Spelling Variation of Political Divisions
The following are actual, alternative spelling in the BIEN database for the political division hierarchy for Cherokee County, Alabama:
United States, Alabama, Cherokee County
US, Alabama, CHEROKEE COUNTY
US, Alabama, cherokee
US, Alabama, Cherokee Co.
U.S.A., Alabama, Cherokee
USA, AL, CHEROKEE
Although it may be readily apparent to the human eye that all of the above names refer to same place, it takes a lot more work for a computer to match them all to "United States, Alabama, Cherokee County". The GNRS does this matching, returning a variety of standardized plain English and native language spellings, along with standard abbreviations, codes and numeric identifiers. For Cherokee County, information returned by the GNRS includes the following fields:
country: | United States |
state_province: | Alabama |
county_parish: | Cherokee |
geonameid: | 4054880 |
gid_0: | USA |
gid_1: | USA.1_1 |
gid_2: | USA.1.10_1 |
country_iso: | US |
state_province_iso: | AL |
county_parish_iso: | 19 |
The above political division is now a machine-readable entity that can be rapidly validated and processed by computer algorithms.
Input
Input to the GNRS is a UTF-8 CSV text file consisting of a header plus one or more one- to three-part political division names. Each name consists of a country (first level), a state_province (second level) and county_parish (third level). Only country name is always required. A value for state_province is required only if county_parish is also included. A user-supplied unique ID (optional) may also be included for joining back to the original data where accents or hidden characters interfere with joins. Fields must be separated with commas, and may also be surrounded with double quotes. Double quotes are required around any values which contain commas. The first line of the file must be the header containing the column names. For more details, see GNRS Input Format.
Field definitions:
Column Name | Meaning | Required? |
user_id | User-supplied unique ID | No |
country | Country name | Yes |
state_province | State/province name | No (yes if followed by county_parish) |
county_parish | Third political division name | No |
Examples of valid input files for the GNRS API and GNRS R package
Example 1 (with user_id):
1,USA,Arizona,Pima
2,U.S.A., Arizona, Pima County
3,United States,Illinois,
4,Mexico,Sonora,
5,Mexico,Oaxaca,Ixtlan
6,México,Oaxaca,Municipio de Ixtlán
Example 1 (no user_id):
,USA,Arizona,Pima
,U.S.A., Arizona, Pima County
,United States, Illinois,
,Mexico,Sonora,
,Mexico,Oaxaca,Ixtlan
,México,Oaxaca,Municipio de Ixtlán
Examples of valid input for the GNRS website
If you are using the GNRS website, just enter <country>,<state>,<county>. You can skip user_id. But don't forget to enter both commas, even if county or state and county are missing.
USA,Arizona,Pima
U.S.A., Arizona, Pima County
United States, Illinois,
Mexico,Sonora,
Mexico,Oaxaca,Ixtlan
México,Oaxaca,Municipio de Ixtlán
Try it out! Copy the above six lines and paste them into the GNRS web interface.
Output
Output from the GNRS is a table of standardized fully-spelled political division names, abbreviations, codes (ISO, HASC, etc.), along with similarity scores of any names requiring fuzzy matching and a short narrative summary of how each name was matched. Both Geonames and GADM names, codes and identifiers are returned. For definitions of fields returned by the GNRS, see the GNRS Data Dictionary. Understanding the meaning of these fields is important; many are also stored within the BIEN analytical database. The raw GNRS output from processing all BIEN observations is retained for inspection within the BIEN database.
The matching process
Each component of the political division triplet is compared against the GeoNames database using exact and fuzzy matching. For each name component, matches are attempted against a variety of name variants until a match is found or all matching options are exhausted. Exact matches are used first, beginning with standard English-language names, followed by accented and unaccented (plain ASCII) versions of alternate names in a variety of languages, including the official language of the political division itself. Matching is also attempted against various standard and non-standard abbreviations and codes, such as 2- and 3-character ISO codes, HASC codes and postal abbreviations. If all exact matches fail, then a second round of fuzzy matching is performed against standard and alternate full names. To avoid excessive false positives, codes and abbreviations are not fuzzy matched.
The matching process is hierarchical. Once a country has been found, only second-level names (i.e., state/province names) within that country are searched. If no match is found at a given level, lower levels are not searched and matching stops.
For each of the three name components, the GNRS reports the match method used ("exact" or "fuzzy"), the name variant matched ("standard name", "alternate name", "ascii name", "iso code", etc.), and, if fuzzy matching was used, the match score. In addition, an overall assessment of match success is provided ("full match", "partial match","no match"), along with the level ("country", "state_province", "county_parish") of the submitted and matched names.
Fuzzy matching
The GNRS performs fuzzy matching using the trigram algorithm, as implemented in PostGreSQL extension pg_trm by the function "similarity". Trigram match scores >= .5 (50%) are considered matches, and scores <0.5 are considered non-matches. The 50% match score threshold was chosen by inspection as providing the best mix of true versus false positives and negatives.
Due to limitations of the PostGreSQL pg_trm extension, fuzzy matching is only effective for latin-based languages. Political division names in languages such as Japanese and Arabic can be matched exactly but not approximately. Fortunately, the GeoNames database contains a large number of name variants (if not misspellings) in non latin-based languages.
The GNRS reference database
Although GNRS database is based largely on GeoNames, the original tables are restructured into country, state_province and county_parish tables to remove unneeded content (for example, names of cities and geographic features) and to improve search speed. Additional content has been added to include widely-used name variants not present in GeoNames. These include HASC codes, common postal abbreviations, and alternative formulations of second- and third-level names with and without rank labels such as "State", "Province", "County", "Municipality", etc. GADM political divisions are matched to their corresponding Geonames political units within the GNRS database, and both GADM and Geonames names, codes and identifiers are returned to the user.
Rank-labelled variants of second- and third-level political division names are searched in multiple languages in a variety of formulations that include and exclude prepositions and articles, as appropriate to each language. For example, the Mexican state of "Sonora" would be searched as "Sonora", "Estado de Sonora", "State of Sonora", "Sonora State", "État de Sonora", etc. These searches are performed on both accented and non-accented versions of each name. All searches are case-insensitive.
GNRS interfaces
- Core GNRS application. The core GNRS service can be accessed directly from the Linux shell. Details of installation and usage are in the GNRS GitHub repository.
- GNRS API. The GNRS API provides a standardized programming interface to the BIEN GNRS that can be invoked using a variety of languages. See the GNRS GitHub repository for examples of how to access the GNRS API in R and PHP.
- RGNRS. The RGNRS R package is a wrapper over top of the GNRS API. It provides a variety of functions that simplify working with the GNRS for users familiar with R. The RGNRS package is not yet on CRAN but can be downloaded from the RGNRS GitHub repository.
- GNRS website. Our new web-based graphical interface lets you use the GNRS without the need for programming.
Build your own GNRS!
You can install your own instance of the GNRS by following the instructions in the GNRS GitHub repository. The GNRS runs on unix-type operating systems (preferably Ubuntu; not tested on other *nix flavors) and requires PostGreSQL 10 or later. PHP is needed only if you will also be using the API. Detailed instructions for installing GNRS Batch are provided here. You will also need to build your own GNRS database (described here) and a local instances of GeoNames (more information here and here).