More

Making curved labels for polygons in QGIS?

Making curved labels for polygons in QGIS?


I have been able to find answers on how to do this in older versions of QGIS, but can't seem to come across anything that deals with the latest installment.

I need a way to make labels curve to irregularly shaped polygons; in this case rivers.

I have been assuming that this option would be listed under "Placement" in the "Labels" options but have been unable to find it anywhere.

I know it was present up until QGIS 2.2 and think it must be included in 2.6.1. I'm just not seeing it.

I am aware of the Easy custom labeling plugin curved option, but that doesn't seem to work as it should.


Currently there are no curved polygon labels.

As dassouki suggested, you can use different approaches to create lines which can then be labeled with curved labels. It might be easiest to just draw some labeling lines manually - if the number of features in question is not too high.


Making curved labels for polygons in QGIS? - Geographic Information Systems

Geocachers did not show a priori preferences for different types of land use.

Stated preferences exposed the appreciation of the montado over other forests.

Preferences for open and aquatic landscapes were also exposed.

Geocaching is a good indicator for cultural ecosystem services.

Promoting Montado visitation can decrease dependence of provisioning services.


Introduction

Pangolins (Pholidota: Manidae) are insectivorous mammals found in parts of Africa and Asia (Hua et al., 2015). They are considered the world’s most trafficked wild mammal due to significant consumer demand for their scales and meat (Challender, Harrop & MacMillan, 2015 Cheng, Xing & Bonebrake, 2017). Historically, both African and Asian species have locally been traded for consumption, but as local population levels have declined in parts of Asia (Irshad et al., 2015 Challender, Nash & Waterman, 2020 Wu et al., 2004), researchers have documented a shift in demand from Asia for African pangolins (Challender, Harrop & MacMillan, 2015 Heinrich et al., 2016) which is believed to be the leading cause of declines in African pangolin populations (IUCN, 2020). In addition, habitat destruction and slow reproductive rates restrict the rate at which pangolins can recover from overexploitation (Heinrich et al., 2016), and issues with disease control and dietary husbandry limit the success of captive breeding programs (Hua et al., 2015). As all eight species are listed as either Vulnerable, Endangered or Critically Endangered by the International Union for Conservation of Nature (IUCN IUCN, 2020 Heinrich et al., 2016 Cheng et al., 2017), a better understanding of the threats to, and conservation status of, pangolins is therefore paramount for protecting them.

Despite listing all eight pangolin species under Appendix I of the Convention on International Trade in Endangered Species (CITES) since 2016, pangolin trafficking has often been poorly documented and not effectively monitored, if detected at all (Heinrich et al., 2016), so the actual impact of the global illegal trade on pangolin populations and distributions remain unknown. Furthermore, the lack of adequate modern-day records of pangolin presence makes it hard to investigate geographic changes and consequently predict their extinction risks. Effective species threat assessment relies heavily on changes in the geographical distribution of the species over time (criterion B, IUCN Red List Categories and Criteria IUCN, 2020). Thus, understanding how pangolin distributions have changed in the past decades will provide more insights into their possible population declines and ultimately inform science-based conservation actions.

One possible solution to better understand the conservation status of pangolins is to compare their past and current distributions to highlight regions that may have previously been targeted by traffickers, i.e., regions where species ranges have become smaller, without any obvious associated anthropogenic changes. Museum specimen records can provide both the temporal and spatial data needed to analyse distributional trends (Boakes et al., 2010 Pyke & Ehrlich, 2010 Lister et al., 2011 McLean et al., 2016 Meineke et al., 2018), without relying on expensive, time consuming, long-term surveys (Newbold, 2010 although museum records have other limitations which we highlight in the Discussion). As a result, historical specimen records can be readily used to improve current threat evaluations for pangolins given the paucity of modern data.

Using pangolin museum specimen records from the Global Biodiversity Information Facility (GBIF GBIF, 2019 and the Natural History Museum, London (NHM), with geographic range maps and habitat classifications by the IUCN SSC Pangolin Specialist Group (IUCN, 2020), we produced area of habitat (AOH) maps representing present-day ranges of pangolins and then investigated geographic range contractions in pangolins over the last 150 years by examining overlaps between historical specimen localities and the AOH present-day ranges. We also investigated the effects of land-use change as a proxy for habitat loss, and human population size changes as a proxy for increased exploitation (Woodroffe, 2000).


Multispectral, Aerial Disease Detection for Myrtle Rust ( Austropuccinia psidii ) on a Lemon Myrtle Plantation

180 cm), the mentioned settings achieved a ground sampling distance of approximately 2.8 cm per pixel. At the plantation, we took advantage of an existing experiment in which the impact of fungicide was being assessed on lemon myrtle trees affected by myrtle rust (Lancaster et al., in preparation) utilizing fungicide shown to be effective at controlling myrtle rust [23]. We recorded aerial multispectral images from trees that were free of active disease, having had fungicide successfully applied to them (“treated”), and trees showing symptoms of active myrtle rust infection (“untreated”). Leaves from treated trees showed mostly no signs of A. psidii infection, although some had small purple spots, likely due to infection occurring prior to fungicide application. We exclude the influence of other biotic agents as no other serious pest or pathogen on lemon myrtle was known prior to A. psidii (Manager Gary Mazzorana, Australian Rainforest Products, Lismore, Australia, personal communication). The experimental design consisted of two treated and two untreated rows of trees, separated by rows of trees designated as “buffer” trees to avoid accidental treatment of trees intended to be untreated (Figure 1).


2 Proposed algorithm

Initialization phase Interactive loop Figure 2: High level overview of the proposed approach: information provided by the user modifies the input of the network - not the network itself - allowing an effective interaction

We now describe in details the proposed approach for interactive multi-class segmentation of aerial images. In particular, our goal is to train a neural network with two purposes:

producing an initial high quality segmentation map of the scene without any external help

using annotations provided by an operator to quickly enhance its initial prediction.

To achieve this, we propose a neural network which keeps its original structure but takes as input a concatenation of the classic inputs (e.g. RGB) and of the annotations ( N channels, one per class). These annotations are clicked points. Note that only the inputs of the network are modified and not its weights: this makes the swiftness of the approach. Figure 2 presents a high-level overview of our approach.

We first define our training strategy and then present our study on the annotations themselves.

2.1 Training strategy

In the following, we assume that we have a segmentation reference composed of N classes. Ground-truth maps are the core of our training strategy. On one hand, they are classically used to compute and back-propagate the loss. On the other hand, they are also randomly sparsified to sample annotations . In other words, only a few pixels from the ground-truth are kept to be used as annotations. According to their class, these annotations are encoded in the N annotation channels given as input to the algorithm. To train under various annotation layouts, the number of sampled annotations is random in each training example. Since the network has to be able to create an accurate segmentation map without them, the possibility of a lack of annotations is also sampled. Concretely, this situation means that the annotation channels are filled with zeros.

If the annotations are sampled independently of their class, the following problem may occur. During the evaluation phase, annotations on sub-represented classes can be ignored by the network because it has barely seen any annotation points of these classes during training. Therefore, it has not learned how to use them to enhance its predictions. To overcome this issue, we use a frequency balancing strategy to sample the annotations based on the classes distributions. It allows the network to equally see annotations from each class during training and, therefore, to be efficiently guided once the training is done.

2.2 Annotation representation

We investigate two aspects of the annotation representation: how to position clicks in order to sample the most useful information, and how to encode clicks to get the best benefit.

Click positioning.

Fixing a wrong segmentation implies to provide the system with additional information about the right division. New samples provided by clicks may represent either the inside of an instance or its border.

The first case seems to be the most intuitive. Clicked pixels are inside instances and the annotation points represent the class associated to these instances. Contrary to [41] , we do not sample them at a minimal distance from the boundaries since we assume that an annotator might click near an edge to fine-tune the prediction. For the second case where the annotations represent the borders of the instances, the channel associated to a click corresponds to a class randomly chosen among the ones adjacent to the clicked border.

Aiming to ease the burden of the end users, we also explored softer constraints on the annotations. Indeed, instead of using N annotation channels, we summarized them into a single annotation channel. For the border strategy, this single channel only indicates the presence of a border. For the inside point strategy, it only indicates where the network has initially made a mistake. To implement this latter strategy, we had to slightly modify the training process. The network performs a first inference to create a segmentation map used to find mislabelled regions. Annotations are then sampled in these areas and a second inference is performed. Only this second inference is used to back-propagate the gradients. However, as shown in Section 4.4 , none of these simplified annotations seems promising to efficiently guide the segmentation task.

Click encoding.

User clicks can be encoded in various ways, and such may provide the system with more or less spatial information, as shown in Figure 3 . In particular, we consider:

Small binary area around the annotation points

Euclidean distance transform maps around these points

Figure 3: Binary (left) and distance transform (right) click.

As shown in Section 4 , the inside point strategy with distance transform encoding seems to be our most successful combination.


Making curved labels for polygons in QGIS? - Geographic Information Systems

You have requested a machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Neither BioOne nor the owners and publishers of the content make, and they explicitly disclaim, any express or implied representations or warranties of any kind, including, without limitation, representations and warranties as to the functionality of the translation feature or the accuracy or completeness of the translations.

Translations are not retained in our system. Your use of this feature and the translations is subject to all use restrictions contained in the Terms and Conditions of Use of the BioOne website.

Notes on the Medinilla (Melastomataceae) of Palawan, Philippines, Including Two New Species: M. simplicymosa and M. ultramaficola

J. Peter Quakenbush, 1,* Pastor L. Malabrigo Jr, 2,3 Arthur Glenn A. Umali, 2 Adriane B. Tobias, 4 Lea Magarce-Camangeg, 5 Yu Pin Ang, 6 Rene Alfred Anton Bustamante 6

1 Department of Biological Sciences, Western Michigan University, Kalamazoo, Michigan 49008-5200, USA
2 Department of Forest Biological Sciences, College of Forestry and Natural Resources, University of the Philippines Los Baños, Laguna, Philippines
3 Museum of Natural History, University of the Philippines Los Baños, Laguna, Philippines
4 Graduate School, University of the Philippines Los Baños, Laguna, Philippines
5 College of Sciences, Palawan State University, Tiniguiban Heights, Puerto Princesa City, Palawan, 5300 Philippines
6 Philippine Taxonomic Initiative, Inc., Botanica Building, El Nido, Palawan, 5313 Philippines


Making curved labels for polygons in QGIS? - Geographic Information Systems

You have requested a machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Neither BioOne nor the owners and publishers of the content make, and they explicitly disclaim, any express or implied representations or warranties of any kind, including, without limitation, representations and warranties as to the functionality of the translation feature or the accuracy or completeness of the translations.

Translations are not retained in our system. Your use of this feature and the translations is subject to all use restrictions contained in the Terms and Conditions of Use of the BioOne website.

Hyptidendron pulcherrimum Antar & Harley, sp. nov. (Hyptidinae, Lamiaceae), a new narrowly endemic species from Minas Gerais, Brazil

Guilherme Medeiros Antar, 1,* Raymond Mervyn Harley, 2 José Floriano Barêa Pastore, 3 Paulo Minatel Gonella, 4 Paulo Takeo Sano 5

1 Universidade de São Paulo, Instituto de Biociências, Departamento de Botânica, Rua do Matão 277, 05508-090, São Paulo, SP (Brazil)
2 Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AB, England (United Kingdom) [email protected]
3 Universidade Federal de Santa Catarina, Campus de Curitibanos, Rod. Ulysses Gaboardi, km 3, 89520-000, Curitibanos, SC (Brazil) [email protected]
4 Universidade Federal de São João del-Rei, Campus Sete Lagoas, Rodovia MG-424, km 47, 35701-970, Sete Lagoas, MG (Brazil) [email protected]
5 Universidade de São Paulo, Instituto de Biociências, Departamento de Botânica, Rua do Matão 277, 05508-090, São Paulo, SP (Brazil) [email protected]

* [email protected] (corresponding author)

Includes PDF & HTML, when available

This article is only available to subscribers.
It is not available for individual sale.

Hyptidendron Harley, one of the 19 genera recognized for the subtribe Hyptidinae, has some of its species with a narrow campos rupestres (a Brazilian vegetational formation) distribution, often restricted to a single mountain range. We report a new species, Hyptidendron pulcherrimum Antar & Harley, sp. nov., endemic to a single mountain in the Serra do Padre Ângelo, a disjunct area of campos rupestres from where some new angiosperm species have been recently described. The new species is unique due to the morphological combination of flowers arranged in dichasial cymes, indumentum composed of curved, rigid, broad-based hairs, leaves petiolate, glabrescent and bullate, corolla tomentose, with the tube curved, 7.5-10 mm long and one slightly winged nutlet per fruiting calyx. The new species is compared with Hyptidendron vauthieri (Briq.) Harley the most similar species morphologically. We also provide a complete description, diagnosis, illustration, distribution map with the new species and closely related species, a photograph plate, and a preliminary conservation status assessment.

© Publications scientifiques du Muséum national d'Histoire naturelle, Paris.

Guilherme Medeiros Antar , Raymond Mervyn Harley , José Floriano Barêa Pastore , Paulo Minatel Gonella , and Paulo Takeo Sano " Hyptidendron pulcherrimum Antar & Harley, sp. nov. (Hyptidinae, Lamiaceae), a new narrowly endemic species from Minas Gerais, Brazil," Adansonia 43(1), 1-8, (18 January 2021). https://doi.org/10.5252/adansonia2021v43a1

Received: 12 March 2020 Accepted: 16 June 2020 Published: 18 January 2021


Materials and methods

Study site and species description

The study was conducted in El Yunque National Forest (EYNF) in north-eastern Puerto Rico (Fig. 1). The EYNF is the largest protected area (115 km 2 ) of primary forest in Puerto Rico (Lugo 1994 ) and comprises a series of mountain chains rising to an elevation of 1074 m a.s.l. This elevation gradient has a strong effect on temperature, rain, humidity and the distribution of plants and animals (Garcia-martino et al. 1996 Wang et al. 2003 González et al. 2007 Gould et al. 2008 Willig et al. 2011 Brokaw et al. 2012 ). There are four main forest types along the elevational gradient in EYNF: Tabonuco forest which is dominated by Dacryodes excelsa and occurs between 150 and 600 m a.s.l., Palo Colorado forest which is dominated by Cyrilla racemiflora, and occurs between 600 and 950 m a.s.l, Elfin forest which is dominated by Tabebuia rigida and Eugenia boriquensis and occurs above 950 m a.s.l., and Sierra Palm forest, which is dominated by Prestoea montana and can occur anywhere along the elevational gradient. In addition to the four major forest types, EYNF has a considerable area in old secondary forest (>40 years) that occurs mostly at low elevations near the border of the reserve.

Setophaga angelae is a small passerine bird, endemic to the main island of Puerto Rico (Kepler & Parkes 1972 ). Currently, its distribution is restricted to two protected areas separated by 150 km: EYNF and the Maricao Commonwealth Forest (MCF). The estimated population size is 1800 mature individuals according to IUCN Red List (BirdLife International 2012 ). Besides having a small population size and a restricted geographical distribution, S. angelae is described as rare and cryptic, which could explain its late discovery (Kepler & Parkes 1972 ). At the time of its description, S. angelae was assumed to be restricted to high elevation areas within the Elfin forest (above 950 m a.s.l), although individuals could be found as low as 250 m a.s.l., and in a variety of habitats including Palo Colorado forest, Podocarpus coriaceus forest, secondary forest, coffee plantation and pasturelands (González 2008 ). Banding studies suggest that S. angelae is monogamous and territorial throughout the year (Delannoy-Juliá 2009 ). The territory size was estimated to be approximately one hectare per pair (Kepler & Parkes 1972 ). Vocalizations include the territorial song (common song), an alarm call and a duet song (https://arbimon.sieve-analytics.com/project/elevation).

Sampling design and autonomous recordings

Because elevation is a well-known proxy for habitat type, temperature and animal and plant communities (Brokaw et al. 2012 Kéry, Gardner & Monnerat 2010 ), we collected acoustic data in 60 sites in EYNF along three elevational transects (95–1074 m a.s.l

20 sampling sites per elevational transect) between March 27 and May 6, 2015. The elevational transects took advantage of roads and trails, but all recorders were placed more than 200 m from any road. Along each elevational transect, two recorders, separated by 200 m, were deployed at 100-m elevation interval (from 95 to 1074 m a.s.l). Recorders collected data at each site within a transect for approximately 1 week and were then moved to another elevation transect. The study occurred during the breeding season when song rate is highest (Arroyo-Vasquez 1992 ). Due to the small home range of S. angelae (

1 ha, Kepler & Parkes 1972 ), we believe that it is unlikely that birds from one territory would be recorded by more than one recorder.

Recorders consist of one LG smartphone enclosed in a waterproof case with an external connector linked to a Monoprice microphone. The ARBIMON Touch application (https://play.google.com/store/apps/details?id=touch.arbimon.com.arbimontouch&hl=en) was used to schedule recording events. Recorders were placed on trees at a height of 1·5 m and programmed to record 1 min of audio every 10 min for a total of 144 – 1-min recordings per day. We performed field tests in our study area and we have found that S. angelae vocalizations can be detected by our recorders up to

50 m. Therefore, a site is defined here as a three-dimensional hemisphere space with a radius of approximately 50 m around the recorder.

Bioacoustics data processing and management

The spectrograms of all recordings (n = 38 255) were visually inspected, and if the species appear to be present, we listened to the recordings to make the final decision. This resulted in a detection/non-detection matrix that was then used to fit occupancy models that accounted for imperfect detectability (Fig. 2). The results of these analyses were used as the ‘gold standard’ for comparing results based on three different approaches that used a species identification model created in the ARBIMON analytical platform (https://arbimon.sieve-analytics.com). Below, we summarize the six steps used in creating a species identification model:

  1. Create a template of the vocalization and validate a set of recordings: For the model, we used the territorial song because it is the most distinct and most common vocalization. Fifteen examples of the territorial song were selected to create the template, and 208 recordings were used for the validation data set (i.e. recording were the song was present or absent).
  2. Create a correlation vector between the song template and the spectrogram. The song template was applied to each of the validated recordings. In this step, the template traverses each spectrogram and produces a vector of similarities for each recording (i.e. correlations between the template and sections of the spectrogram). The correlation was generated by the OpenCV function MatchTemplate (Bradski & Kaehler 2008 ).
  3. Extract features of the vectors from the 208 validated recordings. In this step, 12 features of the correlation vector are extracted: mean, median, minimum, maximum, standard deviation, maximum–minimum, skewness, kurtosis, hyper-skewness, hyper-kurtosis, histogram and cumulative frequency histogram.
  4. Create a RandomForest (RF) classifier: the features of the validated recordings (i.e. present or absent) are input into a RandomForest classifier (Breiman 2001 ). The goal was to train the RF model for a binary decision of presence or absence of the territorial song in a recording based on the feature vectors. A confusion matrix is provided (Table S1). The model was adjusted to reduce false positives.
  5. Apply a Threshold approach: this is an alternative approach that is based on manually setting the maximum similarity correlation level of the vectors necessary to assign a recording as having a positive detection. A confusion matrix is provided (Table S1). The model was adjusted to reduce false positives.
  6. Classify all recordings: the RF model and Threshold model were applied to all recordings. This resulted in a data set with a classification of presence or absence based on the RF model and Threshold model for each of the 38 255 recordings.

We then compared the results of the manual validation process with the results from the RF and Threshold approaches. This procedure resulted in four data sets: the manual validation, Threshold, RandomForest and Combined (Table 1). The Threshold, RandomForest and Combined data sets were constructed by manually verifying all the positive detections from the automated species identification models and converting any false-positive detections to true negatives. False-negative detections were assumed to be true-negative detections. We chose not to change false-negative detections because occupancy models can account for this type of error. The Combined data set only included recordings with positive detection in both the RandomForest and Threshold models. Although it is possible to confuse the vocalizations of the Bananaquit Coereba flaveola and Elfin Woods Warbler in the field, we are confident that we do not have any false positives in our data sets because the spectrogram analyses allowed us to visualize and compare the vocalizations, making it easy to distinguish the species.

Data set Recordings Classification presence Manually confirmed presence
Full 38 255 888
RandomForest 38 255 1603 194
Threshold 38 255 437 62
Combined 38 255 67 51
  • All 38 255 recordings were manually inspected for the Full data set. For the RandomForest and Threshold data sets, all recordings were classified using the species model and the recordings that were classified as present were manually inspected. The Combined data set only included recordings where both the RandomForest and Threshold models agreed, and these recordings were also manually inspected.

The analyses were based on recordings between 05:00 and 19:00, but to simplify the detection matrix, we summarize detections in two-hour intervals. This simplification resulted in seven sampling occasions per day, where each sampling occasion included 12 recordings in each two-hour interval. Therefore, our most basic sample unit is defined here as one interval with 12 1-min recordings.

Occupancy modelling

We used the detection/non-detection matrix generated after the validation of the classified data to fit single-season occupancy models using the package Unmarked in r (Fiske & Chandler 2011 ). The occupancy probability of each sampling site was estimated taking into account imperfect detection, following a standard maximum-likelihood hierarchical approach (MacKenzie et al. 2002 ). Our models include a sampling level describing the probability of detection conditioned on occupancy (p), and a biological level describing the probability (ψ) that a site is occupied. Both p and ψ are allowed to vary according habitats characteristics. Because both elevation and forest type are expected to influence S. angelae occurrence (Kepler & Parkes 1972 Anadón-Irizarry 2006 Arendt, Qian & Mineard 2013 ), we chose to include these variables in our occupancy models. We included three continuous and standardized variables representing the effect of elevation on both occupancy and detection parameters: ‘Elevation’, ‘Elevation 2 ’ and ‘Elevation 3 ’, which provides a first-, second- and third-order polynomial function of the elevation data, respectively (Kéry et al. 2010 ). Additionally, we included the effect of per cent cover of five forest types (Tabonuco forest, Secondary forest, Palo Colorado forest, Sierra Palm forest, Elfin forest and Riparian forest) and forest cover in the occupancy and detection parameters. The per cent cover of each forest type was estimated within a buffer with a radius of 100 m centred on the location of each recorder. Forest type classification was based on vegetation classification maps developed by USDA Forest Service (Gould et al. 2008 ). Lastly, we included a variable ‘Hour’, coded as 1–7 for each of the 7 2-h sampling periods. This variable was included in the detection parameter, because it is a good predictor of bird vocal activity (Catchpole & Slater 2003 ). We also included a second (‘Hour 2 ’)- and third-order (‘Hour 3 ’) polynomial function of the hour data.

To create a distribution map for the species in EYNF, we added a grid of 4032 – 3·1 ha hexagons polygons over a map of EYNF and extracted the per cent of vegetation cover of each forest type. We used the function ‘predict’ from the Unmarked package to estimate the probability of occupancy from each hexagon polygon. We used qgis (QGIS Development Team 2015 ) to graph the expected probability of occupancy across EYNF.


Materials and Methods

Field works

In addition to historical collections of Elaphoglossum over 200 years in Madagascar, collecting efforts especially focused on the genus were conducted in the Island since 2004, mainly in protected areas, as most, if not all, of the remaining wet natural forests are included in National Parks and other natural reserves. Collecting permits were granted by Madagascar National Parks and the Ministère de l’Environnement, et du Développement Durable (project numbers: 70/19/MEDD/SG/DGF/DSAP/SCB.Re, and 207/15/MEEMF/SG/DGF/DAPT/SCBT, and 199/15/MEEMF/SG/DGF/DAPT/SCBT, and 241/11/MEF/SG/DGF/DCB.SAP/SCB).

Plants were systematically sampled as modern collections, that is, including herbarium specimens, silica-dried leaf sample, and photos (Gaudeul & Rouhan, 2013). Complete sets of all collections made during these field trips are deposited at TAN or TEF, and with a few exceptions at P duplicates, when available, have been sent elsewhere (or will be sent right away after publication) particularly to K, MO, NBG, NY (herbarium codes follow Thiers, 2018).

Herbarium-based studies

The taxonomic revision that led to defining taxa and building novel identification keys is based on the examination of over 2,600 herbarium specimens representing 2,186 gatherings housed at P, and on-field observations of most Elaphoglossum species. All specimens were databased and are freely available in the Paris Herbarium database at https://science.mnhn.fr/institution/mnhn/collection/p/item/search?lang=en_US. Additional specimens from other herbaria were examined in hand and annotated (BM, G, K, MO, NY, P, PR, TAN, TEF, US) or examined as online images (B, BR, PRE). All measurements, colors and other details included in the descriptions were based on herbarium specimens and data derived from field notes. In evaluating the variability of each species, habitat and ecology were noted in the field, but information on these features were also taken from other herbarium labels.

Illustrations and morphological characters

Herbarium specimens were examined under dissection microscope Leica MZ6, and close-up images acquired through a camera Leica DFC425 provided illustrations for each taxon scales were mounted in glycerin gelatin between slide and slip-cover, and these permanent slides were imaged using a slide scanner Nikon CoolScan V ED. The terminology used to describe the plants is based on Lellinger (2002).

Distribution maps

Distribution maps of new taxa were based on all cited specimens and generated with QGIS 2.14 (QGIS Geographic Information System. Open Source Geospatial Foundation Project. http://qgis.osgeo.org). A background map included five altitudinal ranges corresponding globally to those generally recognized in Madagascar (Humbert, 1955 Faramalala, 1995): 0–400 m (green), 400–800 m (yellow), 800–1,200 m (light brown), 1,200–1,800 m (medium brown) and >1,800 m (dark brown). Localities of specimens were represented by red dots (and open circles represented the six main cities in Madagascar). Distribution is also described in the text for each species and subspecies according to the five Malagasy phytogeographic domains as defined by Humbert (1955): East, Sambirano, Center, West, and South.

New botanical taxa

New botanical taxa were described only after considering all species known at least in Madagascar, Africa, Western Indian Ocean Islands (Comoros, Seychelles, La Réunion, Mauritius), and circumaustral islands from the Atlantic and the Indian Ocean. Thus, a morphological comparison to most closely-related species from those areas is provided through diagnoses and keys.

The electronic version of this article in Portable Document Format (PDF) will represent a published work according to the International Code of Nomenclature for algae, fungi, and plants (ICN), and hence the new names contained in the electronic version are effectively published under that Code from the electronic edition alone. In addition, new names contained in this work which have been issued with identifiers by IPNI will eventually be made available to the Global Names Index. The IPNI LSIDs can be resolved and the associated information viewed through any standard web browser by appending the LSID contained in this publication to the prefix “http://ipni.org/”. The online version of this work is archived and available from the following digital repositories: PeerJ, PubMed Central, and CLOCKSS.


Supporting Information

S1 Fig. Bayesian Information Criterion (BIC) as a function of number of clusters for plots 1–3.

Ten different combinations of constraints for multivariate mixture models have been tested: EII = spherical, equal volume VII = spherical, unequal volume EEI = diagonal, equal volume and shape VEI = diagonal, varying volume, equal shape EVI = diagonal, equal volume, varying shape VVI = diagonal, varying volume and shape EEE = ellipsoidal, equal volume, shape, and orientation EEV = ellipsoidal, equal volume and equal shape VEV = ellipsoidal, equal shape VVV = ellipsoidal, varying volume, shape, and orientation.

S1 Table. Georeferenced values of δ 15 N (‰), δ 13 C (‰) and N concentration (g N*kg -1 ) used to create isoscapes.


Making curved labels for polygons in QGIS? - Geographic Information Systems

Over the past decade the abundance of location-aware mobile devices has simplified recording of high-precision, high-accuracy geospatial data for the distribution of organisms. Several mobile apps are now available for this purpose (e.g., iNaturalist iSpot ebird) these contribute to the quality of citizen science databases ( Spyratos and Lutz 2014 ). However, most biodiversity specimens collected prior to the 1990s do not have a latitude and longitude associated with them ( Beaman and Conn 2003 ). This means that many of the world’s three billion biodiversity specimens ( Beach et al. 2010 ), including insects on pins, plants on sheets, and fish in jars—some collected as long as three centuries ago—are not easily mapped. Therefore, their value as an historical baseline for research, education, and policymaking is limited ( Cook et al. 2014 Hanken 2013 ).

Citizen science participants are playing an increasingly important role in transcribing specimen label data ( Ellwood et al. 2015 ), but the expansion of georeferencing of specimen collection localities by public participants lags, partly owing to the dearth of online tools enabling georeferencing and the lack of experiments assessing the quality of the data produced. Here we present two experiments in which locality descriptions were georeferenced (assigned a latitude and longitude coordinate) by both expert and novice participants. We compare the data generated by the two groups and suggest downstream analyses to produce the most accurate locality estimates.

Georeferencing of historical localities is just one of many applications within the field of historical GIS ( Gregory and Ell 2007 ). While we focus here on members of the public georeferencing biodiversity specimens, research in the digital humanities also has made important contributions to current georeferencing methodologies and technologies. For example Georeferencer, an online application designed to enable crowd-sourced rectifying of digital images of historic maps, has been modified and successfully implemented by numerous European institutions ( Fleet et al. 2012 ). These efforts have resulted in tens of thousands of maps available online for increased discoverability, integration with modern map layers, improved visualizations, and a host of specialized research projects ( Fleet et al. 2012 Holdsworth 2003 www.bl.uk/maps/georefabout.html ). Like other fields, the digital humanities have turned to volunteers and crowd-sourcing to improve the rate at which historic documents are georeferenced ( Offen 2012 ).

Volunteered Geographic Information (VGI) is a term coined in 2007 ( Goodchild 2007 ) to recognize the fact that Internet-based media were incorporating geographic information wherever possible, including websites and mobile device apps for shopping, mapping, social connections, and weather ( Sui and Goodchild 2011 ). VGI has grown tremendously over the last decade as evidenced by the millions of registered users on OpenStreetMap ( openstreetmap.org Haklay and Weber 2008 )—a world map created and maintained by volunteers—and WikiMapia ( wikimapia.org ), a highly annotated world map with embedded links to related Wikipedia articles. OpenStreetMap also has a humanitarian arm of volunteers who are applying their geographical skills in poorly mapped parts of the world which are in need of aid, e.g., after the years-long rebellion in the Central African Republic and after the 2015 earthquake in Nepal ( hot.openstreetmap.org ).

Geotagging also has grown in popularity as text messaging systems, social media outlets, and photo sharing sites (in particular Flickr.com ) have enabled users to include geographic information with these various media ( Barve 2014 Kumar and Seitz 2014 ). Participation in, and demand for, this functionality illustrates a general public interest in working with geographic interfaces, expanding geographic data and improving freely available geographic information. Specific applications of geotagging have allowed researchers to track epidemic outbreaks ( Lampos and Cristianini 2010 ), leverage the public’s interest in visiting clean water bodies for improved water quality ( Keeler et al. 2015 ), and improve epidemiology research ( Doherty et al. 2011 ).

While research applications of VGI are relatively common ( Sui et al. 2013 ), working with volunteers to add geographic information based on a textual description is relatively uncommon. In one of the few existing examples, volunteers added geographical information to social media posts to provide targeted and specific help to victims of the 2010 earthquake in Haiti ( Meier 2012 ). Immediately after the earthquake, Haitian and college student volunteers in Boston, Massachusetts, scoured the web for social media posts related to the event and created a live map of the locations from where they were sent. Some of these posts had geographic information embedded in them, while others were textual descriptions of a location (i.e., “trapped under house at corner of Main and 1st” Camponovo and Freundschuh 2014 Meier 2012 ) that needed to be given a latitude and longitude. Volunteers classified the posts based on the type of aid that was needed and added them to the map relief organizations then were able to use the live map to provide timely, appropriate help to individuals around the country.

Though less immediately urgent, the approach needed when georeferencing biodiversity specimens is similar to the above example. That is, citizen science participants read locality information in the form of short textual descriptions and transform that information into a latitude and longitude (i.e., a point on a map) and some measure of uncertainty, such as the radius of a circle. Biodiversity research specimens include a description of the locality that references political units (e.g., country, state, county) proximity to the nearest town or other geographical features and/or the habitat (e.g., roadside, forest, lakeshore). Most descriptions require some interpretation and inference on the part of the georeferencer. The biodiversity research community previously established best practices for this type of work ( Chapman et al. 2006 ), however, these practices were described prior to the recent expansion of VGI ( Elwood et al. 2011 Goodchild 2007 ).

Georeferenced biodiversity specimens are crucial for many research applications including conservation (e.g., Miller et al. 2012 Rivers et al. 2011 ), estimating species ranges and extinctions (e.g., Boakes et al. 2010 Gotelli et al. 2012 Tingley and Beissinger 2009 ), habitat modeling (e.g., Fernández et al. 2015 Hope et al. 2013 Zhang et al. 2012 ), and natural resources management (e.g., Taylor et al. 2013 ). However, the level of accuracy and precision of georeferenced data impacts the quality of the downstream research ( Graham et al. 2008 Rowe 2005 ). Taking advantage of the irreplaceable historical data provided by georeferenced biodiversity specimens will require a tremendous effort to georeference specimens currently in collections ( Beach et al. 2010 ) using efficient methods leading to precise results (e.g., Guo et al. 2008 ).

Consider an example locality description from the label of a plant specimen collected in 1927 in Highlands County, Florida, which reads “High pine land Lake Stearns, Fla.” (Fig. 1 ). Turning this locality into a point on a map requires that a georeferencer find the town of Lake Stearns, determine where high pine habitat is likely to occur, and designate a point with a radius of uncertainty that encompasses the most likely collection location(s) of this specimen. To further complicate this process, habitat types and town names change over time. Since the time this specimen was collected nearly 90 years ago the town of Lake Stearns has changed its name to Lake Placid, and the high pine habitat where this specimen was collected may have ceased to exist. Even an expert georeferencer may have trouble as map layers usually reflect only current information, and finding historical town names and habitat types can be challenging. Also, specimen collection localities may be intentionally imprecise if a species is rare (e.g., to reduce illegal harvesting), and during some time periods and at some locations in the last three centuries, collectors were uncertain about precise locations because fine-scale maps and distinguishing features of the landscape were unavailable. Although many collection locality descriptions may be more straightforward than the one provided in this example, considering the breadth of heterogeneity in locality descriptions, can citizen science participants contribute accurate and appropriately precise specimen georeferences?

Label from a plant specimen from the Robert K. Godfrey Herbarium, Florida State University, Tallahassee, FL, US, demonstrating the potential challenges of georeferencing collection localities. In this case, the town has changed names since 1927, the locality description is imprecise, and the habitat is likely now residential development. Labels with such characteristics may be especially difficult for citizen science participants to georeference without local knowledge.

To investigate this question, we engaged undergraduate students as a proxy for the general population of citizen science participants. While we do not have data demonstrating that these students are comparable to the general citizen science community, they are a subset of the general population and represent a range of abilities, levels of innate interest, and prior experience with geographical information and biodiversity research. We chose to use students so that we could generate sufficient data in the absence of an established citizen science georeferencing platform and community. We asked:

How accurate are student georeferencers compared to automated georeferencing software and experts? Does student involvement improve on the accuracy of a georeferencing algorithm?

What method is most effective at estimating an accurate consensus georeference from replicate points for the same collection locality? Is the consensus generated in this way more accurate than the individual points?

How do the best georeferencers compare to the group as a whole? That is, is it useful to only consider the points produced by the most accurate georeferencers?

To address our research questions we conducted two experiments in which undergraduate students and experts georeferenced the same collection localities. The two experiments differed in the spatial distribution of collection localities (seven states in the USA vs. Florida’s Apalachicola National Forest), the biology of the organisms (fish vs. plants), and the number of student georeferences for each locality (1𔃀 vs. 6󈝻 respectively). We addressed question 1 with both datasets and questions 2 and 3 with the many-georeferences-per-location dataset.

Each of the experiments relied on GEOLocate software ( www.museum.tulane.edu/geolocate/ ), which uses an automated georeferencing algorithm to make the human georeferencing more efficient. The algorithm interprets strings of text and provides a suggested point location and radius of uncertainty. GEOLocate displays the most likely point as a green dot and shows red dots for other possible, though less likely, points based on the GEOLocate algorithm. A user can choose one of these suggestions or create another point. GEOLocate also includes features that allow a user to view different map layers, expand the screen, zoom and pan, mark a spot, measure, and save a point. All participants used GEOLocate to assess, navigate, and extract spatial information.

Fish experiment: Thousands of fish localities each georeferenced by one or two students

In the first experiment, 3,372 U.S. fish collection localities from Fishnet2 ( fishnet2.net/aboutFishNet.html ) were each georeferenced by one (or occasionally two) undergraduate student georeferencers at Tulane University (New Orleans, Louisiana, USA) using GEOLocate’s Collaborative Georeferencing platform ( museum.tulane.edu/geolocate/community CoGe). The data were grouped into seven state datasets and distributed among 11 students (undergraduate students in Natural Resource Conservation and Biodiversity Informatics classes taught at Tulane) and eight trained and experienced project technicians, such that each dataset was georeferenced by at least one student and at least one trained, experienced technician. Students and technicians corrected the geolocation recommended by GEOLocate when necessary and saved the latitude and longitude of that chosen location. Student training involved a 50-minute overview on georeferencing biodiversity data followed by demonstrations on using GEOLocate and CoGe. The technicians were hired specifically to georeference fish specimen localities as part of a research grant. They received two days of training, encompassing basic geographic principles, georeferencing methodologies and standards, and project protocols. Many of them had GIS experience prior to the project, and all of them had months of experience georeferencing localities in the project by the time of the experiment.

At Tulane, data processing and analyses were conducted using PostgreSQL 9.3, PostGIS 2.1, Microsoft Access 2010, Microsoft Excel 2010, and Microsoft Excel 2013. Distances between student and expert points and distances between most highly suggested point in GEOLocate and expert points were compared. Records that were not resolvable by GEOLocate were excluded from GEOLocate comparisons. Because we had only one or two student results for each technician result for each locality in the fish dataset, we could not compute means and medians across student results as in the plant experiment.

Plant experiment: Hundreds of plant localities each georeferenced by many students

In the second experiment, 270 plant collection localities from Florida’s Apalachicola National Forest (ANF) each were georeferenced by 6󈝻 students at Florida State University (FSU, Tallahassee, Florida, USA) using GEOLocate’s standard online platform. The plant collection locality descriptions were taken from the database of FSU’s Robert K. Godfrey Herbarium ( www.herbarium.bio.fsu.edu ). Each student was provided an Excel worksheet with collection information parsed into columns: Specimen barcode, scientific name, country, state, county, and locality description. The locality description was an aggregation of entries in the following of the herbarium’s database fields: Nearest Named Place, Special Geographic Unit, Verbatim Directions to Locality, and Habitat. An example is “Bristol, Apalachicola National Forest by Fla Rt. 12, S of Bristol, Apalachicola National Forest, just within boundary, longleaf pine savanna.” An additional column contained links that took the student directly to the GEOLocate website with the specimen’s locality description preloaded in the interface. The full Excel file had 17 different worksheets, each listing 16 specimens (with the exception of the last worksheet which had only 14 specimens).

Each of 154 Florida State University junior and senior undergraduate students enrolled in the course Plant Biology was assigned one worksheet (i.e., 16 or 14 specimen localities) from within the full file to georeference. As a class, students were provided with both a 30-minute training session and written instructions that included a step-by-step guide for augmenting the Excel file with a latitude and longitude (but not a measure of uncertainty) obtained from their work using GEOLocate. Although each worksheet was assigned to the same number of students, some students did not follow directions, so certain worksheets were completed more frequently than others. In the end, each specimen was georeferenced 6󈝻 times (mode = 8, median = 9).

When a student followed a specimen’s link to GEOLocate, they were asked to use GEOLocate’s automated georeferencing algorithm (a button “Georeference”) to produce suggested points, then they could pan, zoom, and open other map layers to show different features, including political boundaries, streets, and aerial photos, until they found the closest approximation of the textual description. Then they cut and pasted the latitude and longitude into Excel. Completion of these tasks, regardless of accuracy, earned the student credit for the required assignment. However, students could opt out of the experiment by choosing not to complete an Institutional Review Board–approved waiver. Students were given one week to complete the assignment during that time they could email one of us (GN) for guidance or help.

Independent of the student work, two local botanists with extensive collecting experience in ANF volunteered to also complete the georeferencing tasks. As local experts, they were familiar with habitat types in the ANF, specific plant populations, favored collection areas, and field collection protocols. This knowledge provided them the advantage over students of being able to more easily interpret and georeference label information. These individuals included a radius of uncertainty with their georeferences and made note of challenging or vague locality descriptions. The experts produced one point for each specimen, which henceforth are referred to as “expert” points.

A small subset of student points in the plant dataset were interpreted as outliers and were removed from the dataset. Such errors included latitude and/or longitude of 0, positive or negative latitude or longitude when the opposite was appropriate for the hemisphere, values that were incomplete, and values that were placed at the exact centroid of the nearby town of Apalachicola (representing an occasional mistake by the GEOLocate algorithm that students did not always correct the town lies outside of the boundaries of ANF). We consider this data-cleaning step to be a reasonable approximation of what can be done by any project doing georeferencing with citizen science participants, and are not using any special knowledge of the expert points at this step. Analyses were conducted with the remaining points in QGIS version 2.6.1 Brighton ( QGIS Development Team 2014 ), Environmental Systems Research Institute’s ArcGIS version 10.2 ( Environmental Systems Research Institute 2014 ), and R statistical software version 3.1.1 ( R Core Development Team 2014 ).

We calculated distance statistics between the expert point and points generated by students for each collection locality, including mean distance of student points and minimum and maximum distance of student points. For these plant experiment data, we calculated a mean and median georeferenced point for each collection locality from the replicate student points using ESRI’s ArcMap spatial statistics tools Mean Center and Median Center, respectively. The Mean Center is simply the average X and average Y coordinate among all the points, while the Median Center tool utilizes an iterative algorithm to calculate the point that minimizes the Euclidian distance among all the student points for a given specimen record. The median point gives less weight to anomalous georeferences. For comparison, we also calculated the distance between the expert point and those suggested as most likely by the GEOLocate algorithm.

Individual students were evaluated for accuracy by comparing their mean distance from expert points (as measured using uncertainty radii for the specific specimens) for all specimens georeferenced by that individual. To determine the increased accuracy brought about by removing the least accurate georeferencers, we re-ran some of the analyses by first excluding 19 students whose complete set of georeferenced points averaged 100 uncertainty radii or greater from the expert’s points, and then by excluding the bottom half (least accurate) of georeferencers. The first exclusion removes those participants who are perhaps least likely to contribute to a citizen science project requiring this skill set, given their poor aptitude for it or their poor engagement in the activity. The second left us with a proxy for those members of the public who are devoted to a citizen science project and likely to become experienced in a way that becomes recognizable to the project. A disproportionate percentage of online tasks often are completed by a very small number of committed citizen science participants ( Eveleigh et al. 2014 ).

Results How accurate are student georeferencers?

Fish experiment —Eleven students produced 4,433 georeferences for 3,372 localities (1,061 localities georeferenced twice). The mean distance of student points from those of expert georeferencers ranged from 1.5󈞷.5 km (mean = 21.3 km). We defined outliers as student points that were greater than two standard deviations from the overall mean displacement of each student’s result from the expert result outlier distance ranged from 13� km across all determinations. Georeferences with greater than a 25 km deviation were typically placed in the wrong county and/or state, and should be detectable through data validation routines involving spatial queries against administrative units in the absence of expert points. Numbers of outliers ranged from just 0󈝽 georeferences (mean = 6.5) per student. Excluding outliers, per-student mean distances between student and expert georeferencer determinations decreased to 0.9󈞔.7 km (overall mean = 8.3). Forty percent of student georeferences were within 0.5 km of the expert points, 53% were within 1 km, and 81% were within 5 km (Fig. 2 ). Considering the uncertainty radius assigned by the experts, 71% of student points were within one uncertainty radius of the expert, and 90% were within 10 (Table 1 ).

Distribution of the distance of student georeferences from expert points in the fish experiment at Tulane University with outliers removed.

Comparison of student points, consensus student points (using mean and median), and GEOLocate automated points to expert points measured by uncertainty radius (UR) for the fish and plant experiments. Because relatively few of the collection locations in the fish experiment were georeferenced by multiple students, we do not report comparisons with the consensus student points for that experiment.

We found that involving students in the process increased the percentage of points within each of the uncertainty radii cut-offs (Table 1 e.g., 71.07% vs. 49.09%, respectively, within 1 uncertainty radius as assigned by the expert georeferencers) and each of the absolute distance cut-offs less than the 10,000 meter cut-off (Table 2 ).

Comparison of student points, GEOLocate automated points, and median of student points to expert points measured by absolute distance for the fish and plant experiments. Because relatively few of the collection locations in the fish experiment were georeferenced by multiple students, we do not report comparisons with the consensus student points for that experiment.

Plant experiment —A total of 2,425 georeferences were produced by students, and after removing outliers, 2,408 (99%) remained. The mean distance between student points and the expert point for each collection locality ranged from 0.18󈞑.08 km, with an overall mean student distance from the respective expert point of 4.62 km.

To make the comparison between use of the automated georeferencing algorithm of GEOLocate alone and the additional involvement of the student georeferencers, we narrowed the number of collection localities to 251 because GEOLocate’s suggested points for the other specimens were returned as errors. The most successful consensus georeferencing method (use of the median point for the replicate student points) places a greater proportion of points within the uncertainty radii thresholds than the GEOLocate-suggested point (Table 1 ). When measuring that distance in meters, the median point outperforms GEOLocate alone, except at a cut-off of 100 m (where GEOLocate alone has a slight advantage Table 2 ).

Which method is most effective for producing an accurate consensus georeference?

For the plant data, use of the median georeferenced point as a consensus of replicate student georeferences is better than the mean georeferenced point at each of several uncertainty distances from the expert point (e.g., 12.22% of the mean points and 18.15% of the median points are within 1 uncertainty radius of their expert point Table 1 ). Unless otherwise indicated, we will use the median georeferenced point as the standard for comparison of the consensus point with the expert point.

The same is true when we consider distance from the expert point using absolute distance (Fig. 3 ). For more than half of the student points in the plant experiment (58.60% 1411 of 2408 points), the median point for a collection locality is at least 10 m closer to the expert point than the individual student point itself. About a quarter of the student points (25.83% 622 points) are at least 10 m closer to the expert points than the median point (Table 2 ). The remainder have similar distances to the expert point as the median point.

Distribution of the distance between mean (black bars) and median (gray bars) consensus of student replicate georeferences from the expert points in the plant experiment at Florida State University with outliers removed.

Is it useful to differentiate data based on georeferencer performance?

About 39% (99 of 254) of the single best student points for a collection locality are within one uncertainty radius of the expert point for that locality (Table 1 ), and about 43% of the single best student points are within 100 m of the expert point (Table 2 ). Examining the 99 single best points within one uncertainty radius we found that 48 (31%) of the 154 students contributed to them and just four students (3%) were responsible for 24 of those points.

We removed 19 of the 154 students contributing to the plant experiment using our threshold for identifying the least talented or motivated georeferencers, reducing the number of georeferenced points from 2408 to 2095 and the number of localities from 258 to 254. Using this reduced data set, the percentage of localities within one uncertainty radius of the expert increased from 18.15% with the full dataset to 23.33% (Table 1 ). Similarly, the percentage of localities that fell within 100 meters of the expert point increased from 5.56% with the full dataset to 23.70% with the reduced dataset (Table 2 ).

When we included only the best 74 (48%) of the plant georeferencers (1185 points), the distance of the median points calculated from the experts as measured by uncertainty radii was improved from the results of the full dataset, but not strikingly (e.g., 18.15% of the medians are within one uncertainty radius for the whole dataset vs. 20.47% for the subset Table 1 ). Looking at improvement based on the absolute distance, however, shows a marked improvement (e.g., 5.56% of the medians are within 100 m for the total dataset vs. 23.90% of the medians for this subset vs. Table 2 ).

Our results provide a first approximation of what can be expected from citizen science participants with minimal georeferencing training. This is a valuable contribution, for while OpenStreetMap ( Haklay and Weber 2008 ) and WikiMapia ( wikimapia.org ) have demonstrated enthusiasm for volunteered geographic information ( Goodchild 2007 ), we are not aware of studies that have assessed the quality of citizen science georeferencing of collection localities for biodiversity specimens or, more generally, of points contributed by georeferencing novices using locality descriptions (e.g., as done by Meier 2012 in another domain). We consider the results encouraging and suggest that they might serve as a benchmark against which to compare future changes to the process, several of which we suggest here.

Our use of undergraduate students as proxies for the general citizen science population, in the absence of an established georeferencing citizen science platform and community, merits further discussion. Coleman et al. ( 2009 ) present a hierarchy of volunteer participation in the context of contributing geographic data. By their definitions, we expect our student volunteers to mostly be neophytes—“an individual without a formal background in a subject, but who possesses the interest, time, and willingness to offer an opinion” (page 338). Whether the potential population of citizen science participants who would contribute data in this way represents a similar fraction of neophytes remains unanswered by our study. Potentially a greater fraction of those who would be motivated to contribute, and possibly some of our more experienced undergraduate volunteers, would qualify as expert amateurs—“someone who may know a great deal about a subject, practices it passionately on occasion, but still does not rely on it for a living”—as would our expert volunteers in the plant experiment. (Our experts from the fish experiment would qualify as expert professionals in Coleman et al.’s scheme—“someone who has studied and practices a subject … [and] relies on that knowledge for a living.”) By Coleman et al.’s estimation, and further analysis by Lauriault and Mooney ( 2014 ), “expert amateurs” may be the most productive volunteer contributors of geographic information, although positive and negative motivations vary across projects and can influence relative involvement of a group. Targeting expert amateurs, or educating neophytes to become expert amateurs, in the biodiversity community might be an effective strategy for increasing contributions and improving their quality beyond that reported here. Expert amateurs might be found as members of native plant societies, entomological clubs, sportsmen’s groups, online communities such as iNaturalist ( inaturalist.org ), and conservation and environmental organizations. Members of historical societies may provide additional local knowledge and a familiarity with regional geographic and landscape features. Future research on the topic could benefit from including a broader demographic of citizen science participants in experiments, along with additional methods such as surveys, to understand the advantages and limitations to working with each of these groups.

Despite large differences in the spatial extent of the areas considered in the experiments (seven states in the US vs. a national forest) and the biology of the organisms (fish in aquatic habitat vs. plants in, mostly, terrestrial habitat), the experiments produced strikingly similar average distances between student- and expert-contributed points (8.3 km with a range of 0.9󈞔.7 km and 4.6 km with a range of 0.2󈞑.1 km, respectively). However, when the distance is measured by uncertainty radii assigned for each collection locality by the experts, differences emerge. Relatively more of the contributed fish georeferences (71%) are within an uncertainty radius of the expert point than the plant georeferences (15%), perhaps because the extent of fish habitat is more easily identified on a map than that of plants and there is often relatively less of it. Also, the relatively larger uncertainty radii of the fish experiment (expert mean = 4,136 m, range = 0�,118 m) than the plant experiment (mean = 1,054 m, range = 16󈞁,095 m) simplified the process for students to place a point within the uncertainty radius of the expert in that experiment.

Creation of a consensus point from replicates for a collection locality improved upon the overall percentage of points within one uncertainty radius in the plant experiment (the fish experiment did not consistently replicate) when the consensus was produced as the median point, but not the mean point (Table 1 ). The median is less sensitive to outliers and makes more sense than the mean for building consensus in this context. We do not address the relationship between number of replicates used to produce the median and the median’s accuracy here, but the relationship has clear importance when designing efficient citizen science projects in the domain. We expect a plateau above which more replicates do not improve accuracy of the median and therefore might represent wasted effort if other statistics are not also being estimated with the additional points. We expect that the location of such a plateau will vary from project to project for reasons discussed above (habitat requirements differ, as do typical sizes of uncertainty radii), and that location needs to be determined in a pilot study specific to that dataset until patterns begin to emerge across datasets. The additional points beyond those needed to improve the median might be important if used to estimate a measure of uncertainty for the locality if there is a relationship between the spread of points and the uncertainty that an expert might assign the locality (e.g., as an uncertainty radius or polygon Chapman 2006 ). The relationship between spread and uncertainty might plateau at a different place than the accuracy of the median.

The accuracy of the data clearly improved beyond that produced using the automated GEOLocate algorithm when students were part of the workflow. The percentage of GEOLocate-generated points within an uncertainty radius of the expert points was improved upon by the students in both experiments (e.g., 12.75% vs. 15.16% within 1 uncertainty radius for the plants Table 1 ), and even more so when the median was calculated (18.15%). Note that the GEOLocate algorithm may have provided an important step in the student and expert contributions, especially in the fish experiment where the spatial extent of possible localities was very large. We actually cannot say whether the involvement of a georeferencing algorithm improved or reduced the accuracy of student points, because the experiment did not make that contrast. Future studies may wish to include an additional experiment that determines accuracy of citizen science participants in the absence of an algorithm. Further consideration of the topic, particularly by researchers in the field of human computation and machine learning, could investigate how the automated georeferencing algorithm could be improved by closing the loop—providing feedback to it in the form of citizen-science contributed data.

While the median-point consensus of replicates represented an improvement on the percentage of individual points within threshold numbers of uncertainty radii (e.g., 18.15% vs. 15.16% within 1 uncertainty radius for the plants Table 1 ), the fact that the single best point for each locality is even more often within those thresholds (38.98% within 1 uncertainty radius for plants Table 1 ) invites the question: are there ways to assess the likelihood that a contributed point is the best for a collection locality in the absence of expert points for all collection localities? One way that this might be accomplished is to assess the overall performance of georeferencers, assigning them reputation scores that reflect attributes such as success with localities for a handful of points that experts have georeferenced. A likelihood of success with such an approach is suggested by the fact that the 99 single best points within one uncertainty radius for plants were contributed by 31% of contributors (and not 65%, which would be one best per each of 99 of the 154 total students). Furthermore, a quarter of those 99 points were contributed by just four students.

We also looked at this relationship in another way, asking if the accuracy of the median point improves when data from only the best georeferencers are considered. In the case of thresholds of uncertainty radii, the percentages improved at most thresholds, but generally not dramatically (e.g., 50.37% at a threshold of 5 uncertainty radii for all georeferencers, 51.48% with exclusion of the 19 worst georeferencers, and 55.51% with the exclusion of the worst half of georeferencers Table 1 ). The improvement is most striking, though, when the absolute distance of median from expert point is considered at low thresholds (e.g., 5.56% within 100 meters for all georeferencers and 23.70% and 23.90% with exclusion of 19 worst and worst half, respectively). This relationship can become especially relevant when the fitness for use depends on a precision within some absolute distance. For example, considering global latitudinal diversity gradients, modeling species distributions, and relocating a population are three activities that typically require increasingly precise data.

Hunter et al. ( 2013 ) provide a case study of an implementation involving data validation and trust metrics for improving the quality and measuring the reliability of citizen science data within Coral Watch ( www.coralwatch.org ). A similar approach could be used to develop a weighted index of reputation based on some combination of (1) total number of user contributions, (2) frequency of user contributions, (3) geospatial deviation from known results, and (4) geospatial deviation for identical localities from users with higher reputation. Liu and Liu ( 2015 ) demonstrate a learning algorithm that can assess the quality of crowd-sourced data and provide results from only the strongest combination of contributors. The ability to sort “good” data from “bad” data, in an environment where the correct information is not known at the start, has obvious applications to the field of citizen science georeferencing, and we anticipate incorporating techniques similar to this in future work.

It is important to realize that, as illustrated in Fig. 1 , there are specimens for which a precise georeference is not warranted and for which the actual collection locality is obscured by the changes of time. For example, 23% of the single best points for the plant localities were not within 1 km of the expert point, despite there being 6󈝻 replicates for each. Based on the plant dataset, types of labels that resulted in large discrepancies between expert and student points included these cases: a) Directional labels that do not specify how the distance is measured. For example, in the case of “Sumatra flatwoods pond, 16 miles N of Sumatra, flatwoods pond,” students measured 16 miles due north, while the experts followed the main road out of Sumatra, which veered to the northeast. This was a common problem, with three of the ten most poorly placed student points falling into this category and b) Labels with overly general or contradictory information. For example, in the case of 𔄜 miles NE of Sumatra, by Fla. Rt. 379,” there is likely an error because Route 379 runs in a northwesterly direction from Sumatra. The issue of flagging collection localities that are likely to fall into this category for georefencing by experts or even the original collector (if still living) merits future consideration. Collection localities could perhaps be classified algorithmically with natural language processing into those requiring triage of this type to make more efficient citizen science engagement for georeferencing.