Why Use BIEN Data? A Tutorial on Flags & Data Augmentation

Most raw botanical records—herbarium specimens, vegetation plots, and
observations—contain at least one error or bias. Names may be misspelled,
outdated, or ambiguous; coordinates may fall in the ocean or on a country centroid;
and cultivated or non-native individuals may masquerade as wild populations.
BIEN (the Botanical Information and Ecology Network) does not just
store plant data—it augments every record so that you can
filter, subset, and defend your data for reproducible science. This tutorial shows you
how that augmentation works, how to read the key flags, and how to turn flagged data
into answers for real ecological questions.

What you will learn.
(1) Why raw biodiversity data need cleaning; (2) how BIEN's validation services add
50+ flags to each record; (3) the seven "key" flags you will use most;
(4) two complementary filtering strategies; and (5) copy-paste R recipes that map flags
to specific research questions. Every code block is hidden behind a
“Show R script” toggle—click to reveal it.

1. The problem: raw plant data are messy

Before any analysis, it helps to know how much cleaning botanical data actually need.
In the BIEN reference paper (Enquist et al. 2026), across the full pool of processed records:

65.4–72.5% of taxonomic names were erroneous or unclear.
Of all observation records whose name could be parsed,
159,189,390 (55.96%) had an issue with taxonomy or coordinate location.
Roughly half of biodiversity records do not meet rigorous criteria for
species distribution modelling (SDM) because of various errors and biases.

This is the core motivation for data augmentation: if you cannot see which records
are problematic, you cannot build a reproducible, defensible dataset. BIEN makes the problems
visible—and filterable.

2. How BIEN augments each record

Every observation that passes through BIEN's validation services is tagged with flags that
encode quality, geographic relevance, and taxonomic accuracy. Four services do the work:

TNRS — Taxonomic Name Resolution Service: standardizes and resolves plant names.
GNRS — Geographic Name Resolution Service: standardizes political/place names.
GVS — the geocoordinate validation service: checks that coordinates are plausible and match stated localities.
NSR — the native-status resolver: determines native vs. introduced status by region.

A record that passes through all BIEN data services is augmented with
more than 50 distinct flags (documented in Tables S1–S4 of the reference paper).
You rarely need all of them—the rest of this tutorial focuses on the ones that matter most.

3. Try it: download BIEN data in R

The BIEN R package
is the easiest way to pull augmented records. The script below downloads occurrence records for a
single, deliberately tricky species—Xanthium strumarium (common cocklebur), which is
widely introduced—and asks BIEN to return the augmentation flags alongside the coordinates.

Show R script — download augmented BIEN occurrences

# install.packages("BIEN")   # first time only
library(BIEN)
library(dplyr)

# Ask BIEN to RETURN the augmentation flags (they are optional arguments).
# By default some services are summarized away; set them to TRUE to see the flags.
occ <- BIEN_occurrence_species(
  species              = "Xanthium strumarium",
  cultivated           = TRUE,   # include + flag cultivated records (is_cultivated)
  only.geovalid        = FALSE,  # RETURN non-geovalid rows too, so you can SEE/filter is_geovalid
  new.world            = NULL,   # NULL = return both New and Old World records
  all.taxonomy         = TRUE,   # full TNRS taxonomy incl. scrubbed_species_binomial
  native.status        = TRUE,   # NSR native/introduced flags (is_introduced, native_status)
  natives.only         = FALSE,  # KEEP introduced records (default TRUE would drop them)
  observation.type     = TRUE,   # plot / specimen / literature / checklist
  political.boundaries = TRUE,   # country / state / county from GNRS
  collection.info      = TRUE    # collector, catalog number, date
)

# Peek at the augmentation columns most useful for filtering:
occ %>%
  select(scrubbed_species_binomial, latitude, longitude,
         is_cultivated, is_introduced, is_geovalid, is_centroid,
         higher_plant_group, observation_type) %>%
  head(8)

# Tip: run ?BIEN_occurrence_species to see every argument and returned column
# for YOUR installed package version. Returned columns can differ by version.

What the output looks like

A schematic view of the returned data frame (values illustrative; real queries return
hundreds to thousands of rows):

scrubbed_species_binomial latitude longitude is_cultivated is_introduced is_geovalid is_centroid higher_plant_group observation_type
Xanthium strumarium 40.10 -88.20 0 0 1 0 flowering plants specimen
Xanthium strumarium 48.85 2.35 0 1 1 0 flowering plants specimen
Xanthium strumarium 51.51 -0.13 1 NA 1 0 flowering plants specimen
Xanthium strumarium -33.87 145.00 0 1 0 NA flowering plants plot
Xanthium strumarium 19.43 -99.13 0 1 1 1 flowering plants specimen
Xanthium strumarium 35.68 139.69 NA NA 0 0 flowering plants literature

Notice the NA (null) values: a null is not a "no." A null in
is_introduced means native status was undetermined, not that the record is
native. How you treat nulls is exactly the "liberal vs. conservative" decision described in
Section 6 below.

4. The seven key flags

BIEN augments records with 50+ flags, but these seven do most of the work in day-to-day analysis.

Flag	What it tells you	Values
`is_cultivated`	Whether the record is a cultivated (planted/garden) specimen rather than a wild individual.	`1` = cultivated; `0` = non-cultivated; `null` = undetermined.
`is_introduced`	Flags observations of non-native (exotic) species in that region (from the NSR). Used to exclude exotics from native-range analyses.	`1` = introduced; `0` = not flagged as introduced; `null` = undetermined.
`observation_type`	The record's source, so you can filter by data type to match your study.	`plot`, `specimen`, `literature`, `checklist`.
`is_geovalid`	Whether the geographic coordinates are validated and plausible (from the GVS).	`1` = verified/accurate; `0` or `null` = erroneous or unverified.
`higher_plant_group`	Taxonomic group, letting you keep target plants and exclude algae, fungi, bacteria, etc.	e.g. `flowering plants`, `ferns and allies`, `bryophytes`.
`is_centroid`	Whether the point is georeferenced to an administrative centroid (country/state center) rather than a true locality.	`1` = centroid; `0` = accurately georeferenced.
`scrubbed_species_binomial`	The standardized, TNRS-resolved name—reduces ambiguity and keeps names consistent across datasets.	Resolved Genus species string.

Read the nulls carefully. For is_geovalid, both 0 and
null mean "do not trust the coordinates." For is_cultivated and
is_introduced, null means "we could not determine it"—so excluding
nulls is conservative and keeping them is liberal.

5. Two ways to filter and subset

BIEN gives you two complementary strategies. Most rigorous workflows combine them.

Flag filtering. Select or exclude records based on flags—for example, drop
cultivated and introduced individuals, keep only geovalid points, and remove centroids.
Geographic & political filtering. Subset by bounding box, polygon, or political
division (country/state/county) to focus on your region of interest.

Show R script — a defensible "clean wild-plant" filter

library(dplyr)

clean <- occ %>%
  filter(
    higher_plant_group == "flowering plants",   # target group only
    is_geovalid == 1,                            # trustworthy coordinates
    is.na(is_centroid) | is_centroid == 0,       # drop admin-centroid points
    is.na(is_cultivated) | is_cultivated == 0,   # drop cultivated specimens
    !is.na(scrubbed_species_binomial)            # keep only resolved names
  )

# CONSERVATIVE native-only set: also require is_introduced == 0 (drops nulls)
native_conservative <- clean %>% filter(is_introduced == 0)   # keep only records NOT flagged introduced

# LIBERAL native-ish set: keep not-introduced AND undetermined (0 or NA), drop only known exotics
native_liberal <- clean %>% filter(is.na(is_introduced) | is_introduced == 0)

nrow(occ); nrow(clean); nrow(native_conservative); nrow(native_liberal)

Show R script — geographic / political subsetting

library(dplyr)

# (a) Political: keep only records BIEN placed in a chosen country
usa <- clean %>% filter(country == "United States")

# (b) Bounding box: a rough spatial window (min/max lon & lat)
bbox <- clean %>%
  filter(longitude >= -125, longitude <= -66,
         latitude  >=   24, latitude  <=  50)

# (c) Region-first at download time is often faster than filtering afterward:
#     BIEN_occurrence_box(min.lat, max.lat, min.long, max.long, cultivated = TRUE,
#                         native.status = TRUE, natives.only = FALSE,
#                         only.geovalid = FALSE, observation.type = TRUE)

6. Winnowing: liberal vs. conservative thresholds

Building a high-confidence dataset is a sequential process. Following the BIEN
workflow for species distribution modelling (Figure 4 of the reference paper), records are
progressively winnowed:

All records → geovalidated. Apply the GNRS and GVS so geographic metadata and coordinates are validated (is_geovalid == 1).
Exclude cultivated & centroids. Use the NSR and GVS to drop cultivated specimens and administrative-centroid points.
Choose a taxonomic/native threshold. The liberal threshold keeps TNRS "No opinion" names and NSR is_introduced = NULL records; the conservative threshold excludes them for stricter quality control.

The practical lesson from this winnowing: about half of raw botanical records
do not survive rigorous SDM-grade filtering. That is not a flaw in BIEN—it is BIEN making an
otherwise invisible problem explicit and reproducible. Always report which threshold you used.

The R recipes above vary only the native-status half of this decision
(is_introduced). The taxonomic half—whether to keep TNRS
“No opinion” names—is controlled through the resolved-name and taxonomic-status
columns (e.g. scrubbed_taxonomic_status); the liberal path keeps them and the
conservative path excludes them. Decide both halves explicitly.

7. From flags to science questions

The same augmented dataset supports many questions—you simply change which flags you filter on.
A few worked examples:

7a. Species distribution modelling (native range)

You want wild, native, well-georeferenced points. Use the conservative filter, then feed coordinates
to your SDM.

Show R script — SDM-ready native points

sdm_points <- occ %>%
  dplyr::filter(
    is_geovalid == 1,
    is.na(is_centroid) | is_centroid == 0,
    is.na(is_cultivated) | is_cultivated == 0,
    is_introduced == 0,                     # conservative: exclude records flagged introduced
    !is.na(scrubbed_species_binomial)
  ) %>%
  dplyr::distinct(scrubbed_species_binomial, longitude, latitude)

# sdm_points now holds thinned, native, geovalid coordinates ready for
# background/pseudo-absence generation and model fitting.

7b. Invasion / non-native biogeography

Here the "exotics" are the point of the study—so you keep introduced records instead of
discarding them.

Show R script — introduced-range occurrences

introduced <- occ %>%
  dplyr::filter(is_introduced == 1, is_geovalid == 1) %>%
  dplyr::select(scrubbed_species_binomial, country, latitude, longitude)

7c. Regional checklist / floristic inventory

Checklists are highly sensitive to synonyms. Standardize names, subset to your region, and
reduce to a unique accepted-name list—then verify names against original submissions (see caveats).

Show R script — a regional species checklist

checklist <- occ %>%
  dplyr::filter(country == "Mexico",
                is_geovalid == 1,
                is.na(is_cultivated) | is_cultivated == 0,
                !is.na(scrubbed_species_binomial)) %>%
  dplyr::distinct(scrubbed_species_binomial) %>%
  dplyr::arrange(scrubbed_species_binomial)

7d. Trait–environment & macroecology

Join clean occurrences to trait data (via BIEN_trait_species()), keeping source and
unit metadata so the merge stays auditable. Match on the standardized
scrubbed_species_binomial to avoid silent name mismatches.

Show R script — occurrences joined to traits

traits <- BIEN_trait_species(
  species        = "Xanthium strumarium",
  source.citation = TRUE   # request source/citation columns (off by default)
)  # returns trait_name, trait_value, unit, plus source/citation columns

# Keep provenance: never drop unit or source before you have checked them.
joined <- clean %>%
  dplyr::left_join(traits, by = c("scrubbed_species_binomial" = "scrubbed_species_binomial"))

Reproducibility habit. Whatever the question, record four things with your results:
(1) the access date, (2) the query scope, (3) the exact flag filters you applied, and
(4) the BIEN db / package versions. That single paragraph makes your dataset re-buildable by anyone.

8. Caveats & best practice for names

Some names still need human review. Any name service is several years behind the
current literature; recent synonymy changes or newly described species may not be resolved
correctly by TNRS. For extensive biodiversity studies this is usually acceptable; for a checklist
it may not be.
Compare accepted names with original names. For synonym-sensitive lists, cross-check
each accepted name against its originally submitted name so nothing is silently merged or missed.
Follow links for verification. Use Name_matched_url and
Accepted_name_url to inspect matched or accepted names in linked taxonomic resources.
See the
TNRS instructions
for interpreting output and best practices.
Nulls are decisions, not defaults. Decide—and document—whether you keep or
drop null values in is_introduced and is_cultivated.

9. Learn more & how to cite

Reference paper (cite when using BIEN flags/augmentation):
Enquist BJ, Boyle B, Maitner BS, et al. (2026). BIEN: A biodiversity informatics ecosystem
advancing open and reproducible workflows for plant observation, plot and trait data.
Methods in Ecology and Evolution, 17(5), 1556–1584.
doi:10.1111/2041-210X.70274
— see Sections 3.2–3.4 and Supporting Information S3.2, Tables S1–S4.
BIEN R package:
Maitner BS, et al. (2018). The BIEN R package. Methods in Ecology and Evolution, 9(2),
373–379.
doi:10.1111/2041-210X.12861
TNRS:
tnrs.biendata.org/instructions
Explore occurrences interactively:
biendata.org

Note. The R output shown above is schematic to illustrate flag structure; the exact
columns returned depend on your query arguments and on the installed BIEN db and R package versions.
Run ?BIEN_occurrence_species to confirm available arguments and returned fields, and
always report the access date, query scope, filters, and versions used so your dataset is reproducible.