A quick guide to species distribution modelling with fuzzySim

fuzzySim is an R package initially designed for computing fuzzy similarity in species distributions. Meanwhile, it can also produce fuzzy species occurrence data to calculate fuzzy similarity from. Among the most widely used methods to produce such fuzzy occurrence data is generalized linear modeling of species presence-absence records, which can provide both occurrence probability and environmental favourability. This tutorial will explore such modelling and its applications.

Note that these procedures can be applied to any other modelling techniques that produce presence probability. Generalized Linear Models (GLMs) are used just as an example which is available in base R.

Installing and loading fuzzySim

The fuzzySim package works within the free and open-source R statistical software, so you first need to download, install and open R (available at http://www.r-project.org). In this tutorial, in monospaced font are the commands that you need to run. For commands that generate visible results in the R console, these are usually shown below them, preceded by hash marks (##). Note that all commands are case-sensitive, so you must respect upper- and lower-case letters; that you must always use straight (', ") rather than curly quotes and apostrophes; and that R is only ready to receive a new command when there’s a prompt sign (>) at the end of the R console; if not, it’s still waiting for an operation to be finished or for you to complete a previous command – watch out for unclosed parentheses or such.

Install fuzzySim by running the command below in R (when connected to the internet):

install.packages("fuzzySim")

You only need to install the package once (for each version of the package or of R), but you need to load it every time you open a new R session in which you intend to use fuzzySim (no need for an internet connection anymore), by getting it from your R library with the following command:

library(fuzzySim)

Preparing and modelling the data

NOTE: This package works with species occurrence data in presence-absence tables. If your data are point occurrences and raster maps, check out this blog post on how to convert them into tabular presence-absence data.

For species distribution modelling, you’ll need a table with species presences and absences in wide format, i.e., one species per column and their presences and absences as ones and zeros; if your data are in long format, with all species in the same column, check out help(splist2presabs) for a way to convert them. You’ll also need the values of a set of predictor variables to use in the model(s). For an example of how your data should be organised, look at the rotif.env sample dataset that comes with fuzzySim (a global dataset of rotifer distribution records published with this article). The following command will load this dataset in your R session:

data(rotif.env)

You can get more information on this dataset with the following command, which should open an R Documentation window:

help(rotif.env)

Now look at the first rows of this dataset:

head(rotif.env)

##    TDWG4   LEVEL_NAME            REGION_NAME        CONTINENT      Area
## 1 ABT-OO      Alberta         Western_Canada NORTHERN_AMERICA 663485.40
## 2 AFG-OO  Afghanistan           Western_Asia   ASIA-TEMPERATE 641921.77
## 3 AGE-BA Buenos_Aires Southern_South_America SOUTHERN_AMERICA 306187.95
## 4 AGE-CH        Chaco Southern_South_America SOUTHERN_AMERICA  99203.11
## 5 AGE-CN   Corrientes Southern_South_America SOUTHERN_AMERICA  88614.06
## 6 AGE-ER   Entre_Rios Southern_South_America SOUTHERN_AMERICA  78071.93
##   Altitude AltitudeRange HabitatDiversity HumanPopulation  Latitude  Longitude
## 1   769.07          3346               12         3461492  54.95520 -114.45960
## 2  1797.41          6347               13        32755566  33.78802   65.98809
## 3    92.66          1092               12        15548773 -36.64692  -60.54985
## 4   115.57           230                7         1090382 -26.38870  -60.76430
## 5    67.60           195                9         1029757 -28.75806  -57.78881
## 6    44.22           125                9         1296896 -32.03426  -59.20174
##   Precipitation PrecipitationSeasonality TemperatureAnnualRange Temperature
## 1        454.96                    52.23                 454.56       0.429
## 2        309.59                    92.11                 403.11      11.728
## 3        813.76                    29.92                 272.70      15.055
## 4        935.89                    57.13                 257.05      21.847
## 5       1292.63                    28.18                 236.53      20.720
## 6       1059.91                    31.85                 256.20      18.215
##   TemperatureSeasonality UrbanArea Abrigh Afissa Apriod Bangul Bcalyc Bplica
## 1               11465.98      1085      0      0      1      1      0      0
## 2                8812.06       790      1      0      1      1      1      1
## 3                5040.31         0      1      1      0      1      1      1
## 4                4147.56         0      0      0      0      1      0      0
## 5                4192.44         0      0      0      0      0      1      1
## 6                4637.56         0      0      0      0      0      1      0
##   Bquadr Burceo Cgibba Edilat Flongi Kcochl Kquadr Ktropi Lbulla Lclost Lhamat
## 1      0      0      0      0      1      1      1      0      0      0      0
## 2      1      1      1      0      0      1      1      1      1      1      1
## 3      1      1      0      1      1      1      1      1      1      1      1
## 4      1      0      1      0      1      0      0      1      1      1      1
## 5      1      0      0      1      1      0      1      0      0      0      0
## 6      1      0      0      1      1      1      0      1      1      0      0
##   Lluna Llunar Lovali Lpatel Lquadr Mventr Ppatul Pquadr Pvulga Specti Tpatin
## 1     1      1      1      0      0      0      1      0      1      1      0
## 2     1      0      1      1      1      0      1      1      1      0      1
## 3     1      1      1      0      1      1      1      1      1      0      1
## 4     0      1      1      0      1      1      1      1      1      0      0
## 5     1      0      1      0      0      1      1      1      1      0      0
## 6     0      0      0      0      0      0      1      1      0      0      1
##   Tsimil Ttetra
## 1      0      1
## 2      1      0
## 3      0      1
## 4      1      1
## 5      1      0
## 6      0      0

Show the column names of this dataset, to see which columns contain the species data and which contain the variables:

names(rotif.env)

##  [1] "TDWG4"                    "LEVEL_NAME"              
##  [3] "REGION_NAME"              "CONTINENT"               
##  [5] "Area"                     "Altitude"                
##  [7] "AltitudeRange"            "HabitatDiversity"        
##  [9] "HumanPopulation"          "Latitude"                
## [11] "Longitude"                "Precipitation"           
## [13] "PrecipitationSeasonality" "TemperatureAnnualRange"  
## [15] "Temperature"              "TemperatureSeasonality"  
## [17] "UrbanArea"                "Abrigh"                  
## [19] "Afissa"                   "Apriod"                  
## [21] "Bangul"                   "Bcalyc"                  
## [23] "Bplica"                   "Bquadr"                  
## [25] "Burceo"                   "Cgibba"                  
## [27] "Edilat"                   "Flongi"                  
## [29] "Kcochl"                   "Kquadr"                  
## [31] "Ktropi"                   "Lbulla"                  
## [33] "Lclost"                   "Lhamat"                  
## [35] "Lluna"                    "Llunar"                  
## [37] "Lovali"                   "Lpatel"                  
## [39] "Lquadr"                   "Mventr"                  
## [41] "Ppatul"                   "Pquadr"                  
## [43] "Pvulga"                   "Specti"                  
## [45] "Tpatin"                   "Tsimil"                  
## [47] "Ttetra"

You can see that predictor variables are in columns 5 to 17 and species presence/absence data are in columns 18 to 47. You can get distribution models for multiple species simultaneously with the multGLM function of fuzzySim. You must specify the name of the dataset, the names or index numbers of the columns containing the species data and the variables, and optionally the name or index number of the column containing the row identifiers. There are a number of additional options on how to select variables for the models; the following command will create an object named rotif.mods containing models built for each of these species with the default multGLM settings:

rotif.mods <- multGLM(data = rotif.env, sp.cols = 18:47, var.cols = 5:17, id.col = 1)

The object output by multGLM is a list containing three elements: a dataframe with the resulting predictions (of which you can see the first rows by typing head(rotif.mods$predictions)), a list named models with the model objects (which you can call by typing rotif.mods$models), and a list of vectors naming the variables included in each model (rotif.mods$variables). Let’s check out one of the models, for example the first one in the list:

summary(rotif.mods$models[[1]])

## 
## Call:
## glm(formula = Abrigh ~ HabitatDiversity + TemperatureSeasonality + 
##     Area + Precipitation + PrecipitationSeasonality + HumanPopulation, 
##     family = binomial, data = model$model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7438  -0.9705  -0.5980   1.0993   1.9609  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.224e+00  1.003e+00  -2.218 0.026539 *  
## HabitatDiversity          3.558e-01  9.173e-02   3.879 0.000105 ***
## TemperatureSeasonality   -2.197e-04  5.342e-05  -4.112 3.93e-05 ***
## Area                      6.668e-07  2.735e-07   2.439 0.014744 *  
## Precipitation            -6.174e-04  2.554e-04  -2.418 0.015611 *  
## PrecipitationSeasonality -1.208e-02  4.935e-03  -2.448 0.014350 *  
## HumanPopulation           1.141e-08  5.400e-09   2.112 0.034660 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.85  on 290  degrees of freedom
## Residual deviance: 344.63  on 284  degrees of freedom
## AIC: 358.63
## 
## Number of Fisher Scoring iterations: 4

You can also call a model by the name of the species in the modelled dataset (result will be the same as above):

summary(rotif.mods$models[["Abrigh"]])

## 
## Call:
## glm(formula = Abrigh ~ HabitatDiversity + TemperatureSeasonality + 
##     Area + Precipitation + PrecipitationSeasonality + HumanPopulation, 
##     family = binomial, data = model$model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7438  -0.9705  -0.5980   1.0993   1.9609  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.224e+00  1.003e+00  -2.218 0.026539 *  
## HabitatDiversity          3.558e-01  9.173e-02   3.879 0.000105 ***
## TemperatureSeasonality   -2.197e-04  5.342e-05  -4.112 3.93e-05 ***
## Area                      6.668e-07  2.735e-07   2.439 0.014744 *  
## Precipitation            -6.174e-04  2.554e-04  -2.418 0.015611 *  
## PrecipitationSeasonality -1.208e-02  4.935e-03  -2.448 0.014350 *  
## HumanPopulation           1.141e-08  5.400e-09   2.112 0.034660 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.85  on 290  degrees of freedom
## Residual deviance: 344.63  on 284  degrees of freedom
## AIC: 358.63
## 
## Number of Fisher Scoring iterations: 4

Additional modelling options

Let’s now take a closer look at the different modelling options available in multGLM. If you type help(multGLM), under Usage, you can see which are the default parameters:

multGLM(data, sp.cols, var.cols, id.col = NULL, family = "binomial",
test.sample = 0, FDR = FALSE, correction = "fdr", FDR.first = TRUE,
corSelect = FALSE, cor.thresh = 0.8, cor.method = "pearson", step = TRUE,
trace = 0, start = "null.model", direction = "both", select = "AIC",
trim = TRUE, Y.prediction = FALSE, P.prediction = TRUE, Favourability = TRUE,
group.preds = TRUE, TSA = FALSE, coord.cols = NULL, degree = 3, verbosity = 2,
test.in = "Rao", test.out = "LRT", p.in = 0.05, p.out = 0.1, ...)

So, the command above should produce the same results as the multGLM command executed before, where most of these arguments were not specified explicitly. The first three arguments (data, sp.cols and var.cols) do not have default values, so they always need to be specified by the user; but the remaining parameters have their default values set, so for example you can keep id.col as NULL if you don’t have or don’t want to use an ID column. The family argument currently has only one option available in multGLM, so the function will produce an error message if you try to specify a different one.

The test.sample argument is 0 by default, but it can be increased if you want part of the data to be reserved for testing the model, and thus not used for model training. You can specify either a value between 0 and 1, for a proportion of the data to choose randomly (e.g. 0.2 for 20%); an integer number, for a particular number of cases to choose randomly among the rows in data; a vector of integers, for the index numbers of the particular rows to set aside; or “Huberty”, for his rule of thumb on how many data should be set aside based on the number of variables.

The FDR argument, which is FALSE by default, indicates whether there should be a pre-selection of variables based on the significance of their bivariate relationship with the species’ occurrence. If you set it to TRUE, the FDR function is called automatically – see help(FDR) for more info on the procedure, which you may also try directly on your dataset.

The corSelect argument, which is also FALSE by default, indicates whether there should be a pre-selection of variables with pair-wise correlations above the threshold given in the next argument (cor.thresh). If you set it to TRUE, the corSelect function is called automatically – see help(corSelect) for more info on the procedure, which you may also try directly on your dataset.

The step argument, which is TRUE by default, defines whether variables should be included in the models with a stepwise selection procedure, by default based on Akaike’s Information Criterion (AIC), using the step function of R. The three following arguments are relevant only when step = TRUE: trace shows (or not, if FALSE) the intermediate results of the stepwise inclusion of model variables; start defines whether the inclusion of the variables should start forward (with “null.model”) or backward (with “full.model)”; and direction specifies in which direction the variable selection should proceed (“forward”, “backward”, or “both”; see help(step) for more info.)

Arguments Y.prediction, P.prediction and Favourability define the type of predictions you want in the output predictions table. Y (FALSE by default) is the prediction in the scale of the predictor variables (i.e. the logit equation); P is the prediction in the scale of the response variable (i.e. probability, varying between 0 and 1); and Favourability is the prevalence-independent version of probability (also between 0 and 1), which can be directly compared across species, regions and time periods (see Details in help(Fav) for more info).

Argument TSA, which is FALSE by default, lets you define whether you want a trend surface analysis (calculated individually for each species) to be automatically added as an additional spatial variable in each model. The two following arguments, coord.cols and degree, are used when TSA = TRUE. See help(multTSA) for more information.

Other arguments have been added to this function along the years. You can refer to the help(multGLM) help file for detailed info and references on each of them.

Model evaluation

You can analyse these models and evaluate their performance with the modEvA R package, which is also available on CRAN and on R-Forge together with another short tutorial.

Model extrapolation and comparison

You can use all these models at once to predict onto a different dataset (which can be either a table, a raster stack or a SpatRaster object) containing the same variables (with the same names) but for another region or time period: see the Examples in help(getPreds) for how to do this.

If you’ve calculated Favourability (which is provided by default with the multGLM() function, but which you can get from any other presence probability model with the Fav() function), which is directly comparable among species with different prevalences, you can then use fuzzy logic to combine the predictions of different models (with no need to choose a threshold and convert them to binary predictions) using intersection, union and other logical operations. These can get you, for example, the favourability for the simultaneous occurrence of a set of species, or the favourability for occurrence of at least one of the species in a given set (see e.g. this article for illustrated details). You can also calculate the fuzzy consensus between a bunch of models, i.e., the fuzzy equivalent of the proportion of models that agree that the species can potentially occur at each site.

names(rotif.mods$predictions)
# get the favourability (_F) columns:
fav_cols <- grep("_F", names(rotif.mods$predictions))

rotif.mods$predictions$Fav_all <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "intersection")
rotif.mods$predictions$Fav_any <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "union")
rotif.mods$predictions$consensus <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "consensus")

You can also quantify overall similarity between models, e.g. with niche comparison metrics (such as Schoener’s D and Warren’s I, which quantify how similar two sets of model predictions are) available in the modOverlap function, or with fuzzy versions of traditionally binary similarity indices (Jaccard, Baroni, Sorensen, Simpson) with the fuzSim function. Jaccard and Baroni’s indices have the advantage of having associated tables of significance, so you can check if the similarity between your models is higher, lower, or similar to what could be expected by chance given your sample size.

fav_current <- rotif.mods$predictions$Ttetra_F
# imagine you have a model prediction for this species in a future time
# (here we'll simulate one by randomly jittering the current predictions)
fav_imag <- jitter(fav_current, amount = 0.2)
fav_imag[fav_imag < 0] <- 0
fav_imag[fav_imag > 1] <- 1
# now compute similarity between current and imaginary predictions:
modOverlap(fav_current, fav_imag)

## $SchoenerD
## [1] 0.9037397
## 
## $WarrenI
## [1] 0.9896593
## 
## $HellingerDist
## [1] 0.1438103

fuzSim(fav_current, fav_imag, method = "Jaccard")

## [1] 0.8229377

fuzSim(fav_current, fav_imag, method = "Baroni")

## [1] 0.9038297

You can also quantify the overall range change (the fuzzy equivalent of expansion area, contraction area, maintenance area and overall balance) based on the predictions of two models for different time periods:

fuzzyRangeChange(fav_current, fav_imag)  # results in the plot are proportions

##                 Number Proportion
## Gain             14.70       0.10
## Loss            -13.01      -0.09
## Stable positive   2.43       0.02
## Stable negative   2.57       0.02
## Balance           1.69       0.01

# unless otherwise specified:
fuzzyRangeChange(fav_current, fav_imag, prop = FALSE)

##                 Number
## Gain             14.70
## Loss            -13.01
## Stable positive   2.43
## Stable negative   2.57
## Balance           1.69

Saving your results

You can save your model predictions to disk, for example in CSV format, with the following command:

write.csv(rotif.mods$predictions, file = "predictions.csv")

Modelling your own data

If you want to try this out with your own species data, get the data into R – for example, save your table in a text file, with column names in the first row and columns separated by tabulators, name the file mydata.txt, save it in your R working directory (type getwd() to find out where it is), and then import it to R with the following command:

mydata <- read.table("mydata.txt", header = TRUE, sep = "\t")

As noted above, if your data are point occurrences and raster maps, you can check out this blog post on how to convert them into tabular presence-absence data.

Then reproduce the operations above, but replacing rotif.env with mydata (or whatever name you’ve assigned in the above command) and specifying your column index numbers accordingly. If you use fuzzySim in publications, please cite the paper:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858 (DOI: 10.1111/2041-210X.12372).

That’s it! You can contact me with any suggestions or concerns, but first remember to check for updates to the package or to this tutorial at http://fuzzysim.r-forge.r-project.org. This tutorial was built with RStudio + rmarkdown + knitr. Thanks!