By Hannah L. Owens and Jamie M. Kass, on behalf of all co-authors*
There are billions of species occurrence records served by aggregator databases. The Global Biodiversity Information Facility (GBIF) serves over 1.8 billion occurrence records for species from across the tree of life (GBIF Secretariat 2021), and the Botanical Information and Ecology Network (BIEN) serves over 200 million plant observations (Botanical Information and Ecology Network 2021). The primary datasets these aggregators serve are the result of millions of hours of work by museums and community science initiatives (among others) and are constantly updated as taxonomy changes and data are accrued. Citing the primary datasets that supply data to GBIF and BIEN, together with accession dates, facilitates reproducibility and scientific transparency. These citations also support primary data providers by acknowledging their role as an essential link in the research chain.
However, when researchers download occurrence datasets from multiple primary providers via aggregator databases (such as those used in broad-scale biogeographic and macroecological studies), managing and effectively communicating the metadata can be incredibly time-consuming. This is where our new R package, occCite, comes in. occCite is designed to facilitate searches of dataset aggregation services (currently, GBIF and BIEN) that store and manage metadata on primary data providers, database accession dates, DOIs, and taxonomic sources in a unified framework within the R environment. Search results are organized as single objects that can be passed to functions to generate visual and statistical summaries and generate formatted citations.
occCite’s Two Main Steps
Taxonomic Rectification. By default, occQuery() checks species’ names against the GBIF backbone taxonomy. The user may instead elect to use studyTaxonList() to prepare a data object with the species’ names to be searched that has been checked against a taxonomy of their choice from the Global Names Index (http://gni.globalnames.org/).
Text Summaries. When the print() method is used on an occCiteData object, tables summarizing taxonomic cleaning results, search results with counts of occurrences for each species from each dataset aggregator, and the GBIF DOIs associated with each species’ search are returned.
Summary Plots. occCite provides three types of plots for results from occQuery() when the plot() method is used on an occCiteData object: a histogram showing occurrences by year, a waffle plot showing the proportion of results supplied by GBIF versus BIEN, and a waffle plot showing the proportion of occurrences supplied by each primary data provider. These plots can be generated either for all search results or by species.
Maps. Interactive leaflet maps can be generated from occCiteData objects via the occCiteMap() function, for all search results or by species. Users can specify occurrence point marker colors and symbologies. Hovering over a point in the interactive map provides information on the species name, coordinates, date, dataset, and dataset aggregator that supplied it.
The Future of occCite
occCite has been integrated as a module in the development version of Wallace, a modular, R-based graphical user interface for modeling species’ ecological niches and geographic distributions (Kass et al. 2018). When Wallace users opt to include data source citations in occurrence data searches, occCite will be invoked to run the search and generate citations.
In the future, we plan to expand the number of database aggregators that occCite queries, and add various fit-for-purpose filtering actions (e.g., duplicate removal, temporal downsampling, geographic and environmental outlier removal). We also plan to add comparative summary plots for raw vs. filtered data or comparing different occCiteData objects. We hope you’ll keep up-to-date via our GitHub website (hannahlowens.github.io/occCite/) for these and other exciting developments!
CRAN release: https://CRAN.R-project.org/package=occCite
YouTube Tutorial: https://www.youtube.com/watch?v=7qSCULN_VjY&t=17s
Botanical Information and Ecology Network. 2021. BIEN, the Botanical Information and Ecology Network. bien.nceas.ucsb.edu, accessed 6 August 2021.
GBIF Secretariat. 2021. GBIF: Global Biodiversity Information Facility. gbif.org, accessed 6 August 2021.
Kass, JM, Vilela, B, Aiello‐Lammens, ME, Muscarella, R, Merow, C. and Anderson, RP. 2018. Wallace: A flexible platform for reproducible modeling of species niches and distributions built for community expansion. Methods in Ecology and Evolution, 9: 1151-1156. DOI: 10.1111/2041-210X.12945
*Originally written for Ecography blog
It's official! Starting today, I am a Marie Skłodowska-Curie Fellow, with a project titled "MARDIGRAS: Elucidating MARine DIversity GRAdients with Empirical and Theoretical ModelS". I'm thrilled for the opportunity to take my experiences of the last few years working on biodiversity patterns in butterflies and birds, and puzzle through how to infer biodiversity patterns and their underlying mechanisms in marine fishes.
The aim of the project is to take a deep dive (sorry not sorry for the pun) into understanding worldwide biodiversity patterns for three groups of marine fishes (my beloved Gadiformes, aka codfishes; Scombriformes, aka mackerels and tunas; and Beloniformes, aka flyingfishes), develop a mechanistic model of how biodiversity patterns arise in marine systems, and then contrast diversity patterns among the three groups of fishes and with the mechanistic model.
While there is extensive macroecological literature on diversity gradients in terrestrial systems, especially regarding latitude, marine systems seem to be following a different set of rules. First, geographic patterns of diversity appear to be neither clear nor ubiquitous. Recently, Chaudhary et al. (2016) found a bimodal diversity curve with respect to latitude (that is, more species were found at middle latitudes than at the equator), whereas Rabosky et al (2018) found a unimodal curve with diversity concentrated at the equator (which is the expected pattern for terrestrial groups). Admittedly, these studies had different organismal foci and employed different methods, but this disparity is striking.
Second, the mechanisms underlying diversity patterns in marine systems appear to be quite different. Generally, in terrestrial systems it is thought that speciation is highest in the tropics, as this is where the most energy is concentrated (among other explanations). However, the aforementioned study by Rabosky and colleagues found that speciation (in marine fishes) was highest at high latitudes! Some of this may be attributable to the unique properties of ocean ecosystems compared to terrestrial ones. As such, one of my project goals is to adapt a mechanistic model that was developed for terrestrial tropical biodiversity (Rangel et al. 2018) and adapt it for the marine context, using diversity patterns in codfishes, mackerels, and flyingfishes to evaluate how realistic the model is.
Stay tuned as I make progress in this exciting area, either here or by following #mardigrasProj on Twitter!
Recently, I got the very exciting news that a project I've been leading was awarded second place in the Ebbe Nielsen Challenge, an annual contest put on by the Global Biodiversity Information Facility (GBIF). The idea behind the challenge is to recognize projects that use GBIF-supplied biodiversity data and tools to innovate and promote open science.
The project I and my colleagues (Cory Merow of the University of Connecticut, Brian Maitner of the University of Arizona, and Vijay Barve & Rob Guralnick of the Florida Museum of Natural History) submitted is an R package called occCite. OccCite helps track where species occurrence data comes from. When we are trying to understand why species are found in a particular place, we often download our data from aggregators like the Global Biodiversity Information Facility. GBIF is a meta-database that serves data from over a thousand other sources, from museums like the Florida Museum to community science initiatives like eBird and iNaturalist. Often, the datasets we download contain data from multiple primary sources, and it can take a long time to track down a good citation for each source. OccCite looks at the raw data we've downloaded, and generates summaries of data sources, including formatted citations for inclusion in research papers. Citing primary data providers is important not just so that the research we do is reproducible, but also so primary providers like museums can keep track of how the data they provide is being used. Museums can then use this information to demonstrate how relevant their collections are for ongoing research.
I came up with the idea for OccCite after spending the better part of a week creating tables and collecting appropriate citations for a paper I wrote on mapping butterfly diversity that used occurrence data from 37 papers, four community science websites, directly from three natural history museums, four aggregator databases (like GBIF), a colleague's personal collection, and Flickr. Through occCite, you can download all known records from hundreds of museums and community scientists. That data will come not just with where they were found and when, but also comes with tables showing how many records came from each source, as well as pre-formatted citations for that data.
If you are interested in learning more about how to use occCite, I made a video tutorial (because after attending a recent workshop on how to make videos, it seems much less daunting). Here it is:
Last summer, Rob Guralnick, my postdoc advisor at the time, challenged me to come up with a way of estimating climate stability from a series of rasters representing climate change through time (such as one might obtain from WorldClim or PaleoView). We were discussing this because we had a collaborator that was interested in the role climate stability might have played in generating observed geographic patterns of biodiversity in the Neotropics. Less than a year later, the paper's out, and so is an R package that provides the climate stability estimates I generated, as well as tools for you to generate your own climate stability estimates!
The paper can be found here: https://doi.org/10.17161/bi.v14i0.9786
The package is on CRAN: cran.r-project.org/package=climateStability
And here is a small vignette explaining how it works: climatestability_vignette.html
Universitetsparken 15, byg 3
2100 Copenhagen Ø, Denmark
Copyright © 2015