University of Twente Student Theses


Accounting for sampling bias in species distribution modelling using rarefaction

Gichangi, Douglas Mutuota (2021) Accounting for sampling bias in species distribution modelling using rarefaction.

Link to full-text:
(only accessible for UT students and staff)
Abstract:Prediction of the spatial distribution of species is vital for conservation planning. For accurate predictions, an appropriate sampling design should be used. Most ecological data that originate from unscientific sources are often biased towards the areas that are most accessible such as roads and nature parks. Besides, most methods of species distribution modelling (SDM) assume random and uniformly distributed samples. Thereby, spatially biased samples may lead to over or underprediction, therefore, making distribution models unreliable. Mostly presence/absence data are preferred to presence only because they contain more information about species’ habitat. However, to collect presence/absence occurrence data, laborious field surveys and enormous resources are required, making it rare. However, plenty of presence-only (PO) observations exist in herbaria, museums, and online databanks, most of which are electronically accessible. In many instances, remedial treatment is required to make PO data reliable, for example, to correct sampling bias effects in data. Ordinarily, clustering of data may lead to model’s overprediction in areas that are intensively sampled. This effect can be mitigated by attempting to de-cluster the data, for example, rarefaction, or introducing randomly distributed background samples. The average nearest neighbour method was used to test two different wild boar observation datasets for spatial bias. Spatial rarefication was used to de-cluster presence-only data. Then a dataset with a similar number of observations (n) was selected from the original dataset. Five different methods of species distribution modelling (Boosted Regression Trees, Random Forests, Maximum Entropy, Support Vector Machine and Generalized Linear Models) were fitted with the two datasets and environmental predictors. The environmental predictors included 50m resolution Euclidean distance maps from the roads, nature reserves, heath &moor, farmlands, forest, and artificial surface. Randomization of models was undertaken by replicating the models twenty times for each method using bootstrapping. To check for consistency in the model predictions across methods was assessed by computing standard deviation in spatial prediction and comparing the zonal statistics for the various environmental variables. It was found that the FBE dataset was more clustered than volunteer observations which are consistent with the way different observers are distributed. For all methods, the models from rarefied datasets were significantly different from the clustered ones. The machine learning method (RF, SVM, MaxEnt) tends to be more tolerant of survey bias because of clustering compared to empirical models (BRT, GLM). This was demonstrated through t-test statistic whereby the machine learning models were less significant compared with the rest. However, all the models performed well for both datasets, with a mean AUC above 0.8. Regarding variable importance to model permutations, the distance to nature reserves contributed most while distance to water was the least. However, the variability was more evenly distributed for the rarefied dataset compared to the clustered one. More so, all models depicted high prediction uncertainty in water areas while cultivated areas had the least. Therefore, bias correction demonstrated significant improvement to species distribution models’ performance.
Item Type:Essay (Master)
Faculty:ITC: Faculty of Geo-information Science and Earth Observation
Programme:Geoinformation Science and Earth Observation MSc (75014)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page