As with BIEN 2, data within the BIEN 3 database are subjected to a number of validations and standardization, including taxonomic name resolution, geographic name resolution, geovalidation and application of a standard higher taxonomic classification. All validations and standardizations used in BIEN 2 have been applied to BIEN 3, with major algorithmic improvements, particularly for geographic name resolution and geovalidation.
- Taxonomic name resolution
- Geographic name resolution
- Required for geovalidation
- Asserted names of the three political divisions country, state/province and county/parish are translated to standard GADM names
- Country names are standardized first, then states within countries, counties within states
- Standardization consists of various steps, including:
- Converting unconverted utf-8 and extended ascii codes
- Converting alternative codes (for example, 2 character ISO codes for countries) to actual names
- Matching against both accented and plain ascii versions of names
- Lookups against tables of synonymys and alternative names in multiple languages from the GeoNames database
- Scripts by Jim Regetz
- Geovalidation
- "Geovalidation" as used here means checking that the latitude and longitude of a taxon observation falls within its declared political divisions.
- BIEN 3 uses a completely new geovalidation pipeline, developed by Jim Regetz
- The pipeline runs in PostGIS/PostgreSQL, thereby taking advantage of Postgres's ability to natively execute spatial joins
- Political division spatial data from the GADM database of Global Administrative Areas.
- Optimizations include simplification of political division boundaries using the PostGIS implementation of the Douglas-Peucker algorithm
- Geovalidation of all 1,707,970 unique localities within the entire BIEN 3 database to the level of county/parish takes about 2 hours (compare to several weeks in BIEN2 to validate to state level only).
- Counts of individuals per species per plot
- Aggregation of individuals and counts of abundance per species for individuals-based plots
- Combining of plots and specimens
- Observations from both plots and specimens are combined into a single table of georeferenced taxon occurrences
- Detection and flagging of suspected cultivated specimens
- Uses original cultivated flags, if any, plus algorithms based on (a) key words in locality description (e.g., "cultivated", "planted", "garden", etc.), (b) known distributions of specific higher taxa (e.g., no pines south of Nicaragua), and (c) proximity to locations of herbaria and botanical gardens.
- Designation of major higher taxon for each species
- All nodes of the NCBI Taxonomy phylogenetic tree are included in BIEN 3
- Each observation in BIEN 3 is joined to the NCBI phylogenetic backbone by family (using APG III families returned by the TNRS during name resolution; see Taxonomic resolution, above)
- Ancestor lookup are used to populate column `higherPlantGroup`, which provides convenient categories of major higher taxa
- Values of higherPlantGroup: "bryophytes", "ferns and allies", "Flowering plants", "gymnosperms (conifers)", "gymnosperms (non-conifer)"
- Embryophytes (land plants) have a non-null value of higherPlantGroup, non-Embryophytes are null.
- Other more detailed or custom breakdowns possible by querying directly the phylogenetic backbone
- Normalization and indexing of data sources
- Metadata pertaining to data sources and data ownership are linked to the observations they provide
- Enables dataset-level application of access rules and proper attribution of sources
- Standardization of plot methodology metadata
- Standardization of unconstrained vocabulary pertaining to plot methodology enables more reliable selection of inventories collected using standard methodologies (for example, "0.1 ha transect, >=2.5 cm dbh", "1 ha plot, >=10 cm dbh").