Why Use BIEN Data? A Tutorial on Flags & Data Augmentation
Most raw botanical records—herbarium specimens, vegetation plots, and
observations—contain at least one error or bias. Names may be misspelled,
outdated, or ambiguous; coordinates may fall in the ocean or on a country centroid;
and cultivated or non-native individuals may masquerade as wild populations.
BIEN (the Botanical Information and Ecology Network) does not just
store plant data—it augments every record so that you can
filter, subset, and defend your data for reproducible science. This tutorial shows you
how that augmentation works, how to read the key flags, and how to turn flagged data
into answers for real ecological questions.
(1) Why raw biodiversity data need cleaning; (2) how BIEN's validation services add
50+ flags to each record; (3) the seven "key" flags you will use most;
(4) two complementary filtering strategies; and (5) copy-paste R recipes that map flags
to specific research questions. Every code block is hidden behind a
“Show R script” toggle—click to reveal it.
1. The problem: raw plant data are messy
Before any analysis, it helps to know how much cleaning botanical data actually need.
In the BIEN reference paper (Enquist et al. 2026), across the full pool of processed records:
- 65.4–72.5% of taxonomic names were erroneous or unclear.
- Of all observation records whose name could be parsed,
159,189,390 (55.96%) had an issue with taxonomy or coordinate location. - Roughly half of biodiversity records do not meet rigorous criteria for
species distribution modelling (SDM) because of various errors and biases.
This is the core motivation for data augmentation: if you cannot see which records
are problematic, you cannot build a reproducible, defensible dataset. BIEN makes the problems
visible—and filterable.
2. How BIEN augments each record
Every observation that passes through BIEN's validation services is tagged with flags that
encode quality, geographic relevance, and taxonomic accuracy. Four services do the work:
- TNRS — Taxonomic Name Resolution Service: standardizes and resolves plant names.
- GNRS — Geographic Name Resolution Service: standardizes political/place names.
- GVS — the geocoordinate validation service: checks that coordinates are plausible and match stated localities.
- NSR — the native-status resolver: determines native vs. introduced status by region.
more than 50 distinct flags (documented in Tables S1–S4 of the reference paper).
You rarely need all of them—the rest of this tutorial focuses on the ones that matter most.
3. Try it: download BIEN data in R
The BIEN R package
is the easiest way to pull augmented records. The script below downloads occurrence records for a
single, deliberately tricky species—Xanthium strumarium (common cocklebur), which is
widely introduced—and asks BIEN to return the augmentation flags alongside the coordinates.
Show R script — download augmented BIEN occurrences
# install.packages("BIEN") # first time only
library(BIEN)
library(dplyr)
# Ask BIEN to RETURN the augmentation flags (they are optional arguments).
# By default some services are summarized away; set them to TRUE to see the flags.
occ <- BIEN_occurrence_species(
species = "Xanthium strumarium",
cultivated = TRUE, # include + flag cultivated records (is_cultivated)
only.geovalid = FALSE, # RETURN non-geovalid rows too, so you can SEE/filter is_geovalid
new.world = NULL, # NULL = return both New and Old World records
all.taxonomy = TRUE, # full TNRS taxonomy incl. scrubbed_species_binomial
native.status = TRUE, # NSR native/introduced flags (is_introduced, native_status)
natives.only = FALSE, # KEEP introduced records (default TRUE would drop them)
observation.type = TRUE, # plot / specimen / literature / checklist
political.boundaries = TRUE, # country / state / county from GNRS
collection.info = TRUE # collector, catalog number, date
)
# Peek at the augmentation columns most useful for filtering:
occ %>%
select(scrubbed_species_binomial, latitude, longitude,
is_cultivated, is_introduced, is_geovalid, is_centroid,
higher_plant_group, observation_type) %>%
head(8)
# Tip: run ?BIEN_occurrence_species to see every argument and returned column
# for YOUR installed package version. Returned columns can differ by version.
What the output looks like
A schematic view of the returned data frame (values illustrative; real queries return
hundreds to thousands of rows):
Xanthium strumarium 40.10 -88.20 0 0 1 0 flowering plants specimen
Xanthium strumarium 48.85 2.35 0 1 1 0 flowering plants specimen
Xanthium strumarium 51.51 -0.13 1 NA 1 0 flowering plants specimen
Xanthium strumarium -33.87 145.00 0 1 0 NA flowering plants plot
Xanthium strumarium 19.43 -99.13 0 1 1 1 flowering plants specimen
Xanthium strumarium 35.68 139.69 NA NA 0 0 flowering plants literature
Notice the NA (null) values: a null is not a "no." A null in
is_introduced means native status was undetermined, not that the record is
native. How you treat nulls is exactly the "liberal vs. conservative" decision described in
Section 6 below.
4. The seven key flags
BIEN augments records with 50+ flags, but these seven do most of the work in day-to-day analysis.
| Flag | What it tells you | Values |
|---|---|---|
is_cultivated |
Whether the record is a cultivated (planted/garden) specimen rather than a wild individual. | 1 = cultivated; 0 = non-cultivated; null = undetermined. |
is_introduced |
Flags observations of non-native (exotic) species in that region (from the NSR). Used to exclude exotics from native-range analyses. | 1 = introduced; 0 = not flagged as introduced; null = undetermined. |
observation_type |
The record's source, so you can filter by data type to match your study. | plot, specimen, literature, checklist. |
is_geovalid |
Whether the geographic coordinates are validated and plausible (from the GVS). | 1 = verified/accurate; 0 or null = erroneous or unverified. |
higher_plant_group |
Taxonomic group, letting you keep target plants and exclude algae, fungi, bacteria, etc. | e.g. flowering plants, ferns and allies, bryophytes. |
is_centroid |
Whether the point is georeferenced to an administrative centroid (country/state center) rather than a true locality. | 1 = centroid; 0 = accurately georeferenced. |
scrubbed_species_binomial |
The standardized, TNRS-resolved name—reduces ambiguity and keeps names consistent across datasets. | Resolved Genus species string. |
is_geovalid, both 0 andnull mean "do not trust the coordinates." For is_cultivated andis_introduced, null means "we could not determine it"—so excludingnulls is conservative and keeping them is liberal.
5. Two ways to filter and subset
BIEN gives you two complementary strategies. Most rigorous workflows combine them.
-
Flag filtering. Select or exclude records based on flags—for example, drop
cultivated and introduced individuals, keep only geovalid points, and remove centroids. -
Geographic & political filtering. Subset by bounding box, polygon, or political
division (country/state/county) to focus on your region of interest.
Show R script — a defensible "clean wild-plant" filter
library(dplyr)
clean <- occ %>%
filter(
higher_plant_group == "flowering plants", # target group only
is_geovalid == 1, # trustworthy coordinates
is.na(is_centroid) | is_centroid == 0, # drop admin-centroid points
is.na(is_cultivated) | is_cultivated == 0, # drop cultivated specimens
!is.na(scrubbed_species_binomial) # keep only resolved names
)
# CONSERVATIVE native-only set: also require is_introduced == 0 (drops nulls)
native_conservative <- clean %>% filter(is_introduced == 0) # keep only records NOT flagged introduced
# LIBERAL native-ish set: keep not-introduced AND undetermined (0 or NA), drop only known exotics
native_liberal <- clean %>% filter(is.na(is_introduced) | is_introduced == 0)
nrow(occ); nrow(clean); nrow(native_conservative); nrow(native_liberal)
Show R script — geographic / political subsetting
library(dplyr)
# (a) Political: keep only records BIEN placed in a chosen country
usa <- clean %>% filter(country == "United States")
# (b) Bounding box: a rough spatial window (min/max lon & lat)
bbox <- clean %>%
filter(longitude >= -125, longitude <= -66,
latitude >= 24, latitude <= 50)
# (c) Region-first at download time is often faster than filtering afterward:
# BIEN_occurrence_box(min.lat, max.lat, min.long, max.long, cultivated = TRUE,
# native.status = TRUE, natives.only = FALSE,
# only.geovalid = FALSE, observation.type = TRUE)
6. Winnowing: liberal vs. conservative thresholds
Building a high-confidence dataset is a sequential process. Following the BIEN
workflow for species distribution modelling (Figure 4 of the reference paper), records are
progressively winnowed:
- All records → geovalidated. Apply the GNRS and GVS so geographic metadata and coordinates are validated (
is_geovalid == 1). - Exclude cultivated & centroids. Use the NSR and GVS to drop cultivated specimens and administrative-centroid points.
- Choose a taxonomic/native threshold. The liberal threshold keeps TNRS "No opinion" names and NSR
is_introduced = NULLrecords; the conservative threshold excludes them for stricter quality control.
do not survive rigorous SDM-grade filtering. That is not a flaw in BIEN—it is BIEN making an
otherwise invisible problem explicit and reproducible. Always report which threshold you used.
The R recipes above vary only the native-status half of this decision
(is_introduced). The taxonomic half—whether to keep TNRS
“No opinion” names—is controlled through the resolved-name and taxonomic-status
columns (e.g. scrubbed_taxonomic_status); the liberal path keeps them and the
conservative path excludes them. Decide both halves explicitly.
7. From flags to science questions
The same augmented dataset supports many questions—you simply change which flags you filter on.
A few worked examples:
7a. Species distribution modelling (native range)
You want wild, native, well-georeferenced points. Use the conservative filter, then feed coordinates
to your SDM.
Show R script — SDM-ready native points
sdm_points <- occ %>%
dplyr::filter(
is_geovalid == 1,
is.na(is_centroid) | is_centroid == 0,
is.na(is_cultivated) | is_cultivated == 0,
is_introduced == 0, # conservative: exclude records flagged introduced
!is.na(scrubbed_species_binomial)
) %>%
dplyr::distinct(scrubbed_species_binomial, longitude, latitude)
# sdm_points now holds thinned, native, geovalid coordinates ready for
# background/pseudo-absence generation and model fitting.
7b. Invasion / non-native biogeography
Here the "exotics" are the point of the study—so you keep introduced records instead of
discarding them.
Show R script — introduced-range occurrences
introduced <- occ %>%
dplyr::filter(is_introduced == 1, is_geovalid == 1) %>%
dplyr::select(scrubbed_species_binomial, country, latitude, longitude)
7c. Regional checklist / floristic inventory
Checklists are highly sensitive to synonyms. Standardize names, subset to your region, and
reduce to a unique accepted-name list—then verify names against original submissions (see caveats).
Show R script — a regional species checklist
checklist <- occ %>%
dplyr::filter(country == "Mexico",
is_geovalid == 1,
is.na(is_cultivated) | is_cultivated == 0,
!is.na(scrubbed_species_binomial)) %>%
dplyr::distinct(scrubbed_species_binomial) %>%
dplyr::arrange(scrubbed_species_binomial)
7d. Trait–environment & macroecology
Join clean occurrences to trait data (via BIEN_trait_species()), keeping source and
unit metadata so the merge stays auditable. Match on the standardized
scrubbed_species_binomial to avoid silent name mismatches.
Show R script — occurrences joined to traits
traits <- BIEN_trait_species(
species = "Xanthium strumarium",
source.citation = TRUE # request source/citation columns (off by default)
) # returns trait_name, trait_value, unit, plus source/citation columns
# Keep provenance: never drop unit or source before you have checked them.
joined <- clean %>%
dplyr::left_join(traits, by = c("scrubbed_species_binomial" = "scrubbed_species_binomial"))
(1) the access date, (2) the query scope, (3) the exact flag filters you applied, and
(4) the BIEN db / package versions. That single paragraph makes your dataset re-buildable by anyone.
8. Caveats & best practice for names
-
Some names still need human review. Any name service is several years behind the
current literature; recent synonymy changes or newly described species may not be resolved
correctly by TNRS. For extensive biodiversity studies this is usually acceptable; for a checklist
it may not be. -
Compare accepted names with original names. For synonym-sensitive lists, cross-check
each accepted name against its originally submitted name so nothing is silently merged or missed. -
Follow links for verification. Use
Name_matched_urland
Accepted_name_urlto inspect matched or accepted names in linked taxonomic resources.
See the
TNRS instructions
for interpreting output and best practices. -
Nulls are decisions, not defaults. Decide—and document—whether you keep or
dropnullvalues inis_introducedandis_cultivated.
9. Learn more & how to cite
-
Reference paper (cite when using BIEN flags/augmentation):
Enquist BJ, Boyle B, Maitner BS, et al. (2026). BIEN: A biodiversity informatics ecosystem
advancing open and reproducible workflows for plant observation, plot and trait data.
Methods in Ecology and Evolution, 17(5), 1556–1584.
doi:10.1111/2041-210X.70274
— see Sections 3.2–3.4 and Supporting Information S3.2, Tables S1–S4. -
BIEN R package:
Maitner BS, et al. (2018). The BIEN R package. Methods in Ecology and Evolution, 9(2),
373–379.
doi:10.1111/2041-210X.12861 -
TNRS:
tnrs.biendata.org/instructions -
Explore occurrences interactively:
biendata.org
Note. The R output shown above is schematic to illustrate flag structure; the exact
columns returned depend on your query arguments and on the installed BIEN db and R package versions.
Run ?BIEN_occurrence_species to confirm available arguments and returned fields, and
always report the access date, query scope, filters, and versions used so your dataset is reproducible.
