Why Use BIEN Data? A Tutorial on Flags & Data Augmentation

Most raw botanical records—herbarium specimens, vegetation plots, and
observations—contain at least one issue that can affect analysis: names may be misspelled,
outdated, synonymous, or ambiguous; coordinates may be imprecise, misplaced, or assigned to centroids;
and cultivated or non-native individuals may be mixed with wild native occurrences.
BIEN (the Botanical Information and Ecology Network) does not just
store plant data—it augments every record so that you can
filter, subset, and defend your data for reproducible science. This tutorial shows you
how that augmentation works, how to read key flags and fields, and how to turn augmented data
into answers for real ecological questions.

What you will learn.
(1) Why raw biodiversity data need cleaning; (2) how BIEN's validation services add
50+ flags and fields to each record; (3) the key flags and fields you will use most;
(4) two complementary filtering strategies; and (5) copy-paste R recipes that map BIEN fields
to specific research questions. Every code block is hidden behind a
“Show R script” toggle—click to reveal it.

1. The problem: raw plant data are messy

Before any analysis, it helps to know how much cleaning botanical data actually need.
In the BIEN reference paper (Enquist et al. 2026), across the full pool of processed records:

65.4–72.5% of taxonomic names had an issue (for example, misspelling, ambiguity, or synonymy).
Of all observation records whose name could be parsed,
159,189,390 (55.96%) had an issue with taxonomy or coordinate location.
Roughly half of biodiversity records do not meet rigorous criteria for
species distribution modelling (SDM) because of various errors and biases.

This is the core motivation for data augmentation: if you cannot see which records
are problematic, you cannot build a reproducible, defensible dataset. BIEN makes the problems
visible—and filterable.

2. How BIEN augments each record

Every observation that passes through BIEN's validation services is augmented with flags and fields
that encode quality, geographic relevance, taxonomic accuracy, and record metadata. Four services do the work:

TNRS — Taxonomic Name Resolution Service: standardizes and resolves plant names.
GNRS — Geographic Name Resolution Service: standardizes political/place names.
GVS — the geocoordinate validation service: checks that coordinates are plausible and match stated localities.
NSR — the native-status resolver: determines native vs. introduced status by region.

A record that passes through all BIEN data services is augmented with
more than 50 distinct flags and fields (documented in Tables S1–S4 of the reference paper).
You rarely need all of them—the rest of this tutorial focuses on the ones that matter most.

3. Try it: download BIEN data in R

The BIEN R package
is the easiest way to pull augmented records. The script below downloads occurrence records for a
single, deliberately tricky species—Xanthium strumarium (common cocklebur), which is
widely introduced—and asks BIEN to return the augmentation columns alongside the coordinates.

Show R script — download augmented BIEN occurrences

# install.packages("BIEN")   # first time only
library(BIEN)
library(dplyr)

# Ask BIEN to RETURN the augmentation columns (flags and fields; they are optional arguments).
# By default some services are summarized away; set them to TRUE to see the returned fields.
occ <- BIEN_occurrence_species(
  species              = "Xanthium strumarium",
  cultivated           = TRUE,   # include + flag cultivated records (is_cultivated)
  only.geovalid        = FALSE,  # RETURN non-geovalid rows too, so you can SEE/filter is_geovalid
  new.world            = NULL,   # R NULL here means "do not restrict by New vs. Old World" (not the same as NA in returned columns)
  all.taxonomy         = TRUE,   # full TNRS taxonomy incl. scrubbed_species_binomial
  native.status        = TRUE,   # NSR native/introduced flags (is_introduced, native_status)
  natives.only         = FALSE,  # KEEP introduced records (default TRUE would drop them)
  observation.type     = TRUE,   # plot / specimen / literature / checklist
  political.boundaries = TRUE,   # country / state / county from GNRS
  collection.info      = TRUE    # collector, catalog number, date
)

# Peek at the augmentation columns most useful for filtering:
occ %>%
  select(scrubbed_species_binomial, latitude, longitude,
         is_cultivated, is_introduced, is_geovalid, is_centroid,
         higher_plant_group, observation_type) %>%
  head(8)

# Tip: run ?BIEN_occurrence_species to see every argument and returned column
# for YOUR installed package version. Returned columns can differ by version.

What the output looks like

A schematic view of the returned data frame (values illustrative; real queries return
hundreds to thousands of rows):

scrubbed_species_binomial latitude longitude is_cultivated is_introduced is_geovalid is_centroid higher_plant_group observation_type
Xanthium strumarium 40.10 -88.20 0 0 1 0 flowering plants specimen
Xanthium strumarium 48.85 2.35 0 1 1 0 flowering plants specimen
Xanthium strumarium 51.51 -0.13 1 NA 1 0 flowering plants specimen
Xanthium strumarium -33.87 145.00 0 1 0 NA flowering plants plot
Xanthium strumarium 19.43 -99.13 0 1 1 1 flowering plants specimen
Xanthium strumarium 35.68 139.69 NA NA 0 0 flowering plants literature

Does BIEN use 0, NULL, NA, and FALSE to mean the same thing?
No. BIEN does not treat 0, SQL NULL, R NA, R NULL, and
FALSE as equivalent.

A BIEN flag is a yes/no/unknown indicator about a record. Other BIEN fields store resolved names,
source type, taxonomy, geography, or metadata. For flag fields, the mental model is simple:
yes, no, or unknown. Is this record introduced? Is it
cultivated? Did this coordinate pass geovalidation? BIEN keeps those answers separate.
This matters because unknown is not the same as no.

Beginner answer	BIEN/R value	Plain meaning
Yes	`1`	BIEN flagged, detected, or confirmed the condition, depending on the column.
No	`0`	BIEN did not flag the condition, or recorded a negative result, depending on the column.
Unknown	SQL `NULL` / R `NA`	Unknown, missing, or not determined.

For example, in R output:

`is_introduced` value	Meaning
`1`	BIEN treats the record as introduced or non-native.
`0`	BIEN did not flag the record as introduced.
`NA`	BIEN did not determine introduced/native status.

An NA in is_introduced does not mean native. It also does not
mean introduced. It means unknown.

The confusing part is that databases and R use different words for missing values. In the BIEN
database, missing or undetermined values are stored as SQL NULL. When those data are returned
to R, they usually appear as NA.

R also has its own object called NULL. That is different. In a BIEN query argument,
new.world = NULL means "do not apply a New World / Old World restriction." It is an
instruction about the query, not a missing value in an output column.

The practical rule is simple:

Value	How to read it
`1`	detected, flagged, or passed, depending on the column definition
`0`	not flagged, negative, or failed, depending on the column definition
`NA` in R output	unknown or missing
SQL `NULL` in the database	unknown or missing
R `NULL` in function arguments	no object or argument supplied
`FALSE`	a logical value; not a BIEN missing-value code

Always interpret 1 and 0 relative to the specific BIEN field definition. Some fields
encode detected problems, while others encode passed checks or accepted status. The main decision for
users is how to treat unknown values.

4. Seven key BIEN flags and fields

BIEN augments records with 50+ flags and fields, but these seven do most of the work in day-to-day analysis.

Field	What it tells you	Values
`is_cultivated`	Whether the record is a cultivated (planted/garden) specimen rather than a wild individual.	`1` = cultivated; `0` = not flagged as cultivated; SQL `NULL` (R `NA`) = undetermined.
`is_introduced`	Flags observations of non-native (exotic) species in that region (from the NSR). Used to exclude exotics from native-range analyses.	`1` = introduced or treated as non-native; `0` = not flagged as introduced; SQL `NULL` (R `NA`) = undetermined.
`observation_type`	The record's source, so you can filter by data type to match your study.	`plot`, `specimen`, `literature`, `checklist`.
`is_geovalid`	Whether the geographic coordinates pass BIEN geovalidation and are plausible (from the GVS).	`1` = passes BIEN geovalidation; `0` = fails BIEN geovalidation; SQL `NULL` (R `NA`) = geovalidation could not be assessed.
`higher_plant_group`	Taxonomic group, letting you keep target plants and exclude algae, fungi, bacteria, etc.	e.g. `flowering plants`, `ferns and allies`, `bryophytes`.
`is_centroid`	Whether the point is georeferenced to an administrative centroid (country/state center) rather than a true locality.	`1` = likely administrative centroid; `0` = not flagged as centroid; SQL `NULL` (R `NA`) = centroid status unavailable.
`scrubbed_species_binomial`	The standardized, TNRS-resolved name—reduces ambiguity and keeps names consistent across datasets.	Resolved Genus species string.

is_introduced = 1: introduced or treated as non-native.
is_introduced = 0: not flagged as introduced; for filtering, typically retained as native/present without evidence of introduction (not definitive proof of native status).
is_introduced = NA in R: introduced/native status unknown.
is_cultivated = 1: cultivated.
is_cultivated = 0: not flagged as cultivated.
is_cultivated = NA in R: cultivation status unknown.
is_geovalid = 1: passes BIEN geovalidation.
is_geovalid = 0: fails BIEN geovalidation.
is_geovalid = NA in R: geovalidation could not be assessed.

Read missing values carefully. SQL NULL in BIEN appears as R NA in returned
data frames. Missing/undetermined values are not equivalent to 0 or FALSE. For
is_geovalid, treat 0 as failed validation and NA as unknown validation status.
For is_cultivated and is_introduced, NA means status was undetermined.
Note too that is_geovalid == 1 means coordinates are plausible and consistent with the stated
locality—not that they are precise—and is_centroid == 0 means only "not flagged as a
centroid," not verified as an accurate point.

Final rule: 1 and 0 must be read in relation to the specific field definition.
SQL NULL / R NA means missing or undetermined. R NULL is different.
FALSE is not a BIEN missing-value code.

5. Two ways to filter and subset

BIEN gives you two complementary strategies. Most rigorous workflows combine them.

Flag and field filtering. Select or exclude records based on BIEN fields—for example, drop
cultivated and introduced individuals, keep only geovalid points, and remove centroids.
Geographic & political filtering. Subset by bounding box, polygon, or political
division (country/state/county) to focus on your region of interest.

Show R script — an example broad-use clean plant filter

library(dplyr)

clean <- occ %>%
  filter(
    higher_plant_group == "flowering plants",   # target group only
    !is.na(is_geovalid) & is_geovalid == 1,      # keep records that pass BIEN geovalidation; NA (unknown) dropped explicitly
    is.na(is_centroid) | is_centroid == 0,       # drop known centroid points; keep unknowns
    is.na(is_cultivated) | is_cultivated == 0,   # drop known cultivated records; keep unknowns
    !is.na(scrubbed_species_binomial)            # keep only resolved names
  )

# CONSERVATIVE native-only set: also require is_introduced == 0 (drops NA)
native_conservative <- clean %>% filter(is_introduced == 0)   # keep only records NOT flagged introduced

# LIBERAL native-ish set: keep records not flagged introduced AND undetermined (0 or NA), drop only known exotics
native_liberal <- clean %>% filter(is_introduced == 0 | is.na(is_introduced))

nrow(occ); nrow(clean); nrow(native_conservative); nrow(native_liberal)

This example uses a liberal treatment of is_centroid and is_cultivated by keeping
records with unknown status. For finer-scale spatial analyses, users may choose a stricter filter.

Some BIEN filters keep both 0 and missing values. This does not mean they are equivalent.
It is a liberal filtering choice. For example, keeping is_introduced == 0 | is.na(is_introduced)
means "drop known introduced records, but keep records where introduced/native status is unknown." A
conservative filter would keep only is_introduced == 0, dropping unknowns. Conservative means
stricter filtering, not necessarily less biased.

# Liberal filter: drop known introduced records, keep unknown introduced/native status
dat |>
  dplyr::filter(is_introduced == 0 | is.na(is_introduced))

# Conservative filter: keep only records not flagged as introduced, dropping unknowns
dat |>
  dplyr::filter(is_introduced == 0)

The liberal filter keeps both 0 and NA, but it does not treat them as the same value.
It treats them as different evidence classes that the user has chosen to retain.

This contrast only bites when is_introduced is genuinely NA. Some records instead carry an
explicit native-status category (for example a native_status value of "unknown"), which is not the
same as NA. Check how your returned columns encode undetermined status—with
table(native_status, is_introduced, useNA = "always")—before assuming the two filters differ.

Show R script — geographic / political subsetting

library(dplyr)

# (a) Political: keep only records BIEN placed in a chosen country
usa <- clean %>% filter(country == "United States")

# (b) Bounding box: a rough spatial window (min/max lon & lat)
bbox <- clean %>%
  filter(longitude >= -125, longitude <= -66,
         latitude  >=   24, latitude  <=  50)

# (c) Region-first at download time is often faster than filtering afterward:
#     BIEN_occurrence_box(min.lat, max.lat, min.long, max.long, cultivated = TRUE,
#                         native.status = TRUE, natives.only = FALSE,
#                         only.geovalid = FALSE, observation.type = TRUE)

6. Winnowing: liberal vs. conservative thresholds

Building a high-confidence dataset is a sequential process. Following the BIEN
workflow for species distribution modelling (Figure 4 of the reference paper), records are
progressively winnowed:

All records → geovalidated. Apply GNRS and GVS so geographic metadata are standardized and coordinates can be tested. Keep records with is_geovalid == 1, meaning they pass BIEN geovalidation.
Exclude cultivated records & centroids. Use BIEN fields to drop records flagged as cultivated and points flagged as administrative centroids.
Choose a taxonomic/native threshold. The liberal threshold keeps TNRS "No opinion" names and records with missing introduced status (R NA; SQL NULL in BIEN db), while the conservative threshold keeps only records explicitly not flagged as introduced (is_introduced == 0).

In the BIEN reference workflow, roughly half of records may not survive SDM-grade filtering,
depending on the chosen thresholds. That is not a flaw in BIEN—it is BIEN making an
otherwise invisible problem explicit and reproducible. Always report which threshold you used.

Missingness is usually not random. Because NSR native-status coverage is densest in the
New World, is_introduced = NA is more common in some regions than others. Liberal and
conservative filters therefore change not just sample size but the geographic and environmental composition
of the retained records, which can bias SDMs and range estimates. Compare both filters as a sensitivity
check rather than assuming either is unbiased.

The R recipes above vary only the native-status half of this decision
(is_introduced). The taxonomic half—whether to keep TNRS
“No opinion” names—is controlled through the resolved-name and taxonomic-status
columns (e.g. scrubbed_taxonomic_status); the liberal path keeps them and the
conservative path excludes them. Decide both halves explicitly.

7. From BIEN fields to science questions

The same augmented dataset supports many questions—you simply change which fields you filter on.
A few worked examples:

7a. Species distribution modelling (native range)

For native-range SDMs, a common high-confidence starting point is a conservative filter emphasizing
records that pass BIEN geovalidation, excluding known cultivated records and centroids, and using
is_introduced == 0 as a threshold appropriate to the question.

Show R script — SDM-ready native points

sdm_points <- occ %>%
  dplyr::filter(
    !is.na(is_geovalid) & is_geovalid == 1,   # keep geovalidated only; NA (unknown) dropped explicitly
    is.na(is_centroid) | is_centroid == 0,     # drop known centroid points; keep unknowns
    is.na(is_cultivated) | is_cultivated == 0, # drop known cultivated records; keep unknowns
    is_introduced == 0,                     # conservative: exclude records flagged introduced
    !is.na(scrubbed_species_binomial)
  ) %>%
  dplyr::distinct(scrubbed_species_binomial, longitude, latitude)

# sdm_points now holds de-duplicated (exact-coordinate) native, geovalid points.
# distinct() is NOT spatial thinning: before modelling, apply distance-based
# thinning (e.g. spThin) to reduce sampling-bias pseudoreplication, and evaluate
# with spatial/blocked cross-validation (random CV leaks under spatial autocorrelation).

7b. Invasion / non-native biogeography

Here the "exotics" are the point of the study—so you keep introduced records instead of
discarding them.

Show R script — introduced-range occurrences

introduced <- occ %>%
  dplyr::filter(is_introduced == 1, !is.na(is_geovalid) & is_geovalid == 1) %>%
  dplyr::select(scrubbed_species_binomial, country, latitude, longitude)

7c. Regional checklist / floristic inventory

Checklists are highly sensitive to synonyms. Standardize names, subset to your region, and
reduce to a unique accepted-name list—then verify names against original submissions (see caveats).

Show R script — a regional species checklist

checklist <- occ %>%
  dplyr::filter(country == "Mexico",
                !is.na(is_geovalid) & is_geovalid == 1,
                is.na(is_cultivated) | is_cultivated == 0, # drop known cultivated records; keep unknowns
                !is.na(scrubbed_species_binomial)) %>%
  dplyr::distinct(scrubbed_species_binomial) %>%
  dplyr::arrange(scrubbed_species_binomial)

7d. Trait–environment & macroecology

Join clean occurrences to trait data (via BIEN_trait_species()), keeping source and
unit metadata so the merge stays auditable. Match on the standardized
scrubbed_species_binomial to avoid silent name mismatches. BIEN_trait_species()
returns long data (one row per trait measurement), so aggregate to one row per
species × trait before joining—otherwise the join is many-to-many and multiplies your
occurrence rows.

Show R script — occurrences joined to traits

traits <- BIEN_trait_species(
  species        = "Xanthium strumarium",
  source.citation = TRUE   # request source/citation columns (off by default)
)  # returns trait_name, trait_value, unit, plus source/citation columns

# Aggregate numeric traits first: one row per species x trait (keeps unit + evidence count).
# This example uses a mean for numeric traits only.
# Categorical traits require a different summary rule.
# Check that units are harmonized before averaging trait values.
trait_summ <- traits %>%
  dplyr::mutate(trait_value_numeric = suppressWarnings(as.numeric(trait_value))) %>%
  dplyr::filter(!is.na(trait_value_numeric)) %>%
  dplyr::group_by(scrubbed_species_binomial, trait_name) %>%
  dplyr::summarise(mean_value = mean(trait_value_numeric, na.rm = TRUE),
                   n_trait_values = dplyr::n(),
                   unit = dplyr::first(unit), .groups = "drop")

# Now the join is intentional and will not inflate occurrence rows.
joined <- clean %>%
  dplyr::left_join(trait_summ, by = "scrubbed_species_binomial")

Reproducibility habit. Whatever the question, record these with your results:
(1) the access date, (2) the query scope, (3) the exact flag filters you applied,
(4) the BIEN db / package versions and the TNRS/GNRS/NSR service versions, (5) any random seed
(SDM background/pseudo-absence and cross-validation folds are stochastic), and (6) a saved data snapshot
or per-step record counts, since re-querying a live database may not return byte-identical results.
That paragraph makes your dataset re-buildable by anyone.

8. Caveats & best practice for names

Some names still need human review. Any name service is several years behind the
current literature; recent synonymy changes or newly described species may not be resolved
correctly by TNRS. For extensive biodiversity studies this is usually acceptable; for a checklist
it may not be.
Compare accepted names with original names. For synonym-sensitive lists, cross-check
each accepted name against its originally submitted name so nothing is silently merged or missed.
Follow links for verification. Use Name_matched_url and
Accepted_name_url to inspect matched or accepted names in linked taxonomic resources.
See the
TNRS instructions
for interpreting output and best practices.
Missing values require decisions. Decide—and document—whether you keep or
drop missing values, which appear as SQL NULL in the BIEN database and R NA in returned data.

9. Learn more & how to cite

Reference paper (cite when using BIEN flags/augmentation):
Enquist BJ, Boyle B, Maitner BS, et al. (2026). BIEN: A biodiversity informatics ecosystem
advancing open and reproducible workflows for plant observation, plot and trait data.
Methods in Ecology and Evolution, 17(5), 1556–1584.
doi:10.1111/2041-210X.70274
— see Sections 3.2–3.4 and Supporting Information S3.2, Tables S1–S4.
BIEN R package:
Maitner BS, et al. (2018). The BIEN R package. Methods in Ecology and Evolution, 9(2),
373–379.
doi:10.1111/2041-210X.12861
TNRS:
tnrs.biendata.org/instructions
Explore occurrences interactively:
biendata.org

Note. The R output shown above is schematic to illustrate flag structure; the exact
columns returned depend on your query arguments and on the installed BIEN db and R package versions.
Run ?BIEN_occurrence_species to confirm available arguments and returned fields, and
always report the access date, query scope, filters, and versions used so your dataset is reproducible.