fuzzySim is an R package initially designed for computing fuzzy similarity in species distributions. Meanwhile, it can also produce fuzzy species occurrence data to calculate fuzzy similarity from. Among the most widely used methods to produce such fuzzy occurrence data is generalized linear modeling of species presence-absence records, which can provide both occurrence probability and environmental favourability. This tutorial will explore such modelling and its applications.
Note that these procedures can be applied to any other modelling techniques that produce presence probability. Generalized Linear Models (GLMs) are used just as an example which is available in base R.
The fuzzySim package works within the free
and open-source R statistical software, so you first need to
download, install and open R (available at http://www.r-project.org). In this tutorial, in
monospaced
font are the commands that you need to run. For
commands that generate visible results in the R console, these are
usually shown below them, preceded by hash marks (##). Note that all
commands are case-sensitive, so you must respect upper-
and lower-case letters; that you must always use
straight ('
, "
)
rather than curly quotes and apostrophes; and that
R is only ready to receive a new command when there’s a prompt
sign (>) at the end of the R console; if not, it’s still
waiting for an operation to be finished or for you to complete a
previous command – watch out for unclosed parentheses or such.
Install fuzzySim by running the command below in R (when connected to the internet):
install.packages("fuzzySim")
You only need to install the package once (for each version of the package or of R), but you need to load it every time you open a new R session in which you intend to use fuzzySim (no need for an internet connection anymore), by getting it from your R library with the following command:
library(fuzzySim)
NOTE: This package works with species occurrence data in presence-absence tables. If your data are point occurrences and raster maps, check out this blog post on how to convert them into tabular presence-absence data.
For species distribution modelling, you’ll need a table with species
presences and absences in wide format, i.e., one species per
column and their presences and absences as ones and zeros; if your data
are in long format, with all species in the same column, check
out help(splist2presabs)
for a way to convert them. You’ll
also need the values of a set of predictor variables to use in the
model(s). For an example of how your data should be organised, look at
the rotif.env sample dataset that comes with fuzzySim
(a global dataset of rotifer distribution records published with this
article). The following command will load this dataset in your R
session:
data(rotif.env)
You can get more information on this dataset with the following command, which should open an R Documentation window:
help(rotif.env)
Now look at the first rows of this dataset:
head(rotif.env)
## TDWG4 LEVEL_NAME REGION_NAME CONTINENT Area
## 1 ABT-OO Alberta Western_Canada NORTHERN_AMERICA 663485.40
## 2 AFG-OO Afghanistan Western_Asia ASIA-TEMPERATE 641921.77
## 3 AGE-BA Buenos_Aires Southern_South_America SOUTHERN_AMERICA 306187.95
## 4 AGE-CH Chaco Southern_South_America SOUTHERN_AMERICA 99203.11
## 5 AGE-CN Corrientes Southern_South_America SOUTHERN_AMERICA 88614.06
## 6 AGE-ER Entre_Rios Southern_South_America SOUTHERN_AMERICA 78071.93
## Altitude AltitudeRange HabitatDiversity HumanPopulation Latitude Longitude
## 1 769.07 3346 12 3461492 54.95520 -114.45960
## 2 1797.41 6347 13 32755566 33.78802 65.98809
## 3 92.66 1092 12 15548773 -36.64692 -60.54985
## 4 115.57 230 7 1090382 -26.38870 -60.76430
## 5 67.60 195 9 1029757 -28.75806 -57.78881
## 6 44.22 125 9 1296896 -32.03426 -59.20174
## Precipitation PrecipitationSeasonality TemperatureAnnualRange Temperature
## 1 454.96 52.23 454.56 0.429
## 2 309.59 92.11 403.11 11.728
## 3 813.76 29.92 272.70 15.055
## 4 935.89 57.13 257.05 21.847
## 5 1292.63 28.18 236.53 20.720
## 6 1059.91 31.85 256.20 18.215
## TemperatureSeasonality UrbanArea Abrigh Afissa Apriod Bangul Bcalyc Bplica
## 1 11465.98 1085 0 0 1 1 0 0
## 2 8812.06 790 1 0 1 1 1 1
## 3 5040.31 0 1 1 0 1 1 1
## 4 4147.56 0 0 0 0 1 0 0
## 5 4192.44 0 0 0 0 0 1 1
## 6 4637.56 0 0 0 0 0 1 0
## Bquadr Burceo Cgibba Edilat Flongi Kcochl Kquadr Ktropi Lbulla Lclost Lhamat
## 1 0 0 0 0 1 1 1 0 0 0 0
## 2 1 1 1 0 0 1 1 1 1 1 1
## 3 1 1 0 1 1 1 1 1 1 1 1
## 4 1 0 1 0 1 0 0 1 1 1 1
## 5 1 0 0 1 1 0 1 0 0 0 0
## 6 1 0 0 1 1 1 0 1 1 0 0
## Lluna Llunar Lovali Lpatel Lquadr Mventr Ppatul Pquadr Pvulga Specti Tpatin
## 1 1 1 1 0 0 0 1 0 1 1 0
## 2 1 0 1 1 1 0 1 1 1 0 1
## 3 1 1 1 0 1 1 1 1 1 0 1
## 4 0 1 1 0 1 1 1 1 1 0 0
## 5 1 0 1 0 0 1 1 1 1 0 0
## 6 0 0 0 0 0 0 1 1 0 0 1
## Tsimil Ttetra
## 1 0 1
## 2 1 0
## 3 0 1
## 4 1 1
## 5 1 0
## 6 0 0
Show the column names of this dataset, to see which columns contain the species data and which contain the variables:
names(rotif.env)
## [1] "TDWG4" "LEVEL_NAME"
## [3] "REGION_NAME" "CONTINENT"
## [5] "Area" "Altitude"
## [7] "AltitudeRange" "HabitatDiversity"
## [9] "HumanPopulation" "Latitude"
## [11] "Longitude" "Precipitation"
## [13] "PrecipitationSeasonality" "TemperatureAnnualRange"
## [15] "Temperature" "TemperatureSeasonality"
## [17] "UrbanArea" "Abrigh"
## [19] "Afissa" "Apriod"
## [21] "Bangul" "Bcalyc"
## [23] "Bplica" "Bquadr"
## [25] "Burceo" "Cgibba"
## [27] "Edilat" "Flongi"
## [29] "Kcochl" "Kquadr"
## [31] "Ktropi" "Lbulla"
## [33] "Lclost" "Lhamat"
## [35] "Lluna" "Llunar"
## [37] "Lovali" "Lpatel"
## [39] "Lquadr" "Mventr"
## [41] "Ppatul" "Pquadr"
## [43] "Pvulga" "Specti"
## [45] "Tpatin" "Tsimil"
## [47] "Ttetra"
You can see that predictor variables are in columns 5 to 17 and species presence/absence data are in columns 18 to 47. You can get distribution models for multiple species simultaneously with the multGLM function of fuzzySim. You must specify the name of the dataset, the names or index numbers of the columns containing the species data and the variables, and optionally the name or index number of the column containing the row identifiers. There are a number of additional options on how to select variables for the models; the following command will create an object named rotif.mods containing models built for each of these species with the default multGLM settings:
rotif.mods <- multGLM(data = rotif.env, sp.cols = 18:47, var.cols = 5:17, id.col = 1)
The object output by multGLM is a list containing three
elements: a dataframe with the resulting predictions (of which
you can see the first rows by typing
head(rotif.mods$predictions)
), a list named models
with the model objects (which you can call by typing
rotif.mods$models
), and a list of vectors naming the
variables included in each model
(rotif.mods$variables
). Let’s check out one of the models,
for example the first one in the list:
summary(rotif.mods$models[[1]])
##
## Call:
## glm(formula = Abrigh ~ HabitatDiversity + TemperatureSeasonality +
## Area + Precipitation + PrecipitationSeasonality + HumanPopulation,
## family = binomial, data = model$model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7438 -0.9705 -0.5980 1.0993 1.9609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.224e+00 1.003e+00 -2.218 0.026539 *
## HabitatDiversity 3.558e-01 9.173e-02 3.879 0.000105 ***
## TemperatureSeasonality -2.197e-04 5.342e-05 -4.112 3.93e-05 ***
## Area 6.668e-07 2.735e-07 2.439 0.014744 *
## Precipitation -6.174e-04 2.554e-04 -2.418 0.015611 *
## PrecipitationSeasonality -1.208e-02 4.935e-03 -2.448 0.014350 *
## HumanPopulation 1.141e-08 5.400e-09 2.112 0.034660 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 387.85 on 290 degrees of freedom
## Residual deviance: 344.63 on 284 degrees of freedom
## AIC: 358.63
##
## Number of Fisher Scoring iterations: 4
You can also call a model by the name of the species in the modelled dataset (result will be the same as above):
summary(rotif.mods$models[["Abrigh"]])
##
## Call:
## glm(formula = Abrigh ~ HabitatDiversity + TemperatureSeasonality +
## Area + Precipitation + PrecipitationSeasonality + HumanPopulation,
## family = binomial, data = model$model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7438 -0.9705 -0.5980 1.0993 1.9609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.224e+00 1.003e+00 -2.218 0.026539 *
## HabitatDiversity 3.558e-01 9.173e-02 3.879 0.000105 ***
## TemperatureSeasonality -2.197e-04 5.342e-05 -4.112 3.93e-05 ***
## Area 6.668e-07 2.735e-07 2.439 0.014744 *
## Precipitation -6.174e-04 2.554e-04 -2.418 0.015611 *
## PrecipitationSeasonality -1.208e-02 4.935e-03 -2.448 0.014350 *
## HumanPopulation 1.141e-08 5.400e-09 2.112 0.034660 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 387.85 on 290 degrees of freedom
## Residual deviance: 344.63 on 284 degrees of freedom
## AIC: 358.63
##
## Number of Fisher Scoring iterations: 4
Let’s now take a closer look at the different modelling options
available in multGLM. If you type help(multGLM)
,
under Usage, you can see which are the default parameters:
multGLM(data, sp.cols, var.cols, id.col = NULL, family = "binomial",
test.sample = 0, FDR = FALSE, correction = "fdr", FDR.first = TRUE,
corSelect = FALSE, cor.thresh = 0.8, cor.method = "pearson", step = TRUE,
trace = 0, start = "null.model", direction = "both", select = "AIC",
trim = TRUE, Y.prediction = FALSE, P.prediction = TRUE, Favourability = TRUE,
group.preds = TRUE, TSA = FALSE, coord.cols = NULL, degree = 3, verbosity = 2,
test.in = "Rao", test.out = "LRT", p.in = 0.05, p.out = 0.1, ...)
So, the command above should produce the same results as the multGLM command executed before, where most of these arguments were not specified explicitly. The first three arguments (data, sp.cols and var.cols) do not have default values, so they always need to be specified by the user; but the remaining parameters have their default values set, so for example you can keep id.col as NULL if you don’t have or don’t want to use an ID column. The family argument currently has only one option available in multGLM, so the function will produce an error message if you try to specify a different one.
The test.sample argument is 0 by default, but it can be increased if you want part of the data to be reserved for testing the model, and thus not used for model training. You can specify either a value between 0 and 1, for a proportion of the data to choose randomly (e.g. 0.2 for 20%); an integer number, for a particular number of cases to choose randomly among the rows in data; a vector of integers, for the index numbers of the particular rows to set aside; or “Huberty”, for his rule of thumb on how many data should be set aside based on the number of variables.
The FDR argument, which is FALSE by
default, indicates whether there should be a pre-selection of variables
based on the significance of their bivariate relationship with the
species’ occurrence. If you set it to TRUE, the FDR function is
called automatically – see help(FDR)
for more info on the
procedure, which you may also try directly on your dataset.
The corSelect argument, which is also FALSE
by default, indicates whether there should be a pre-selection of
variables with pair-wise correlations above the threshold given in the
next argument (cor.thresh). If you set it to TRUE, the
corSelect function is called automatically – see
help(corSelect)
for more info on the procedure, which you
may also try directly on your dataset.
The step argument, which is TRUE by
default, defines whether variables should be included in the models with
a stepwise selection procedure, by default based on Akaike’s Information
Criterion (AIC), using the step function of R. The three
following arguments are relevant only when step = TRUE:
trace shows (or not, if FALSE) the
intermediate results of the stepwise inclusion of model variables;
start defines whether the inclusion of the
variables should start forward (with “null.model”) or backward (with
“full.model)”; and direction specifies in
which direction the variable selection should proceed (“forward”,
“backward”, or “both”; see help(step)
for more info.)
Arguments Y.prediction,
P.prediction and
Favourability define the type of predictions
you want in the output predictions table. Y (FALSE by
default) is the prediction in the scale of the predictor variables
(i.e. the logit equation); P is the prediction in the scale of
the response variable (i.e. probability, varying between 0 and 1); and
Favourability is the prevalence-independent version of
probability (also between 0 and 1), which can be directly compared
across species, regions and time periods (see Details in
help(Fav)
for more info).
Argument TSA, which is FALSE by default,
lets you define whether you want a trend surface analysis (calculated
individually for each species) to be automatically added as an
additional spatial variable in each model. The two following arguments,
coord.cols and
degree, are used when TSA = TRUE. See
help(multTSA)
for more information.
Other arguments have been added to this function along the years. You
can refer to the help(multGLM)
help file for detailed info
and references on each of them.
You can analyse these models and evaluate their performance with the modEvA R package, which is also available on CRAN and on R-Forge together with another short tutorial.
You can use all these models at once to predict onto a
different dataset (which can be either a table, a raster stack
or a SpatRaster
object) containing the same variables (with
the same names) but for another region or time period: see the Examples
in help(getPreds)
for how to do this.
If you’ve calculated Favourability (which is provided by default with
the multGLM()
function, but which you can get from any
other presence probability model with the Fav()
function),
which is directly comparable among species with different prevalences,
you can then use fuzzy logic to combine the predictions
of different models (with no need to choose a threshold and convert them
to binary predictions) using intersection,
union and other logical operations.
These can get you, for example, the favourability for
the simultaneous occurrence of a set of species, or the
favourability for occurrence of at least one of the
species in a given set (see e.g. this
article for illustrated details). You can also calculate the
fuzzy consensus between a bunch of models, i.e., the
fuzzy equivalent of the proportion of models that agree that the species
can potentially occur at each site.
names(rotif.mods$predictions)
# get the favourability (_F) columns:
fav_cols <- grep("_F", names(rotif.mods$predictions))
rotif.mods$predictions$Fav_all <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "intersection")
rotif.mods$predictions$Fav_any <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "union")
rotif.mods$predictions$consensus <- fuzzyOverlay(rotif.mods$predictions[ , fav_cols], op = "consensus")
You can also quantify overall similarity between models, e.g. with niche comparison metrics (such as Schoener’s D and Warren’s I, which quantify how similar two sets of model predictions are) available in the modOverlap function, or with fuzzy versions of traditionally binary similarity indices (Jaccard, Baroni, Sorensen, Simpson) with the fuzSim function. Jaccard and Baroni’s indices have the advantage of having associated tables of significance, so you can check if the similarity between your models is higher, lower, or similar to what could be expected by chance given your sample size.
fav_current <- rotif.mods$predictions$Ttetra_F
# imagine you have a model prediction for this species in a future time
# (here we'll simulate one by randomly jittering the current predictions)
fav_imag <- jitter(fav_current, amount = 0.2)
fav_imag[fav_imag < 0] <- 0
fav_imag[fav_imag > 1] <- 1
# now compute similarity between current and imaginary predictions:
modOverlap(fav_current, fav_imag)
## $SchoenerD
## [1] 0.9037397
##
## $WarrenI
## [1] 0.9896593
##
## $HellingerDist
## [1] 0.1438103
fuzSim(fav_current, fav_imag, method = "Jaccard")
## [1] 0.8229377
fuzSim(fav_current, fav_imag, method = "Baroni")
## [1] 0.9038297
You can also quantify the overall range change (the fuzzy equivalent of expansion area, contraction area, maintenance area and overall balance) based on the predictions of two models for different time periods:
fuzzyRangeChange(fav_current, fav_imag) # results in the plot are proportions
## Number Proportion
## Gain 14.70 0.10
## Loss -13.01 -0.09
## Stable positive 2.43 0.02
## Stable negative 2.57 0.02
## Balance 1.69 0.01
# unless otherwise specified:
fuzzyRangeChange(fav_current, fav_imag, prop = FALSE)
## Number
## Gain 14.70
## Loss -13.01
## Stable positive 2.43
## Stable negative 2.57
## Balance 1.69
You can save your model predictions to disk, for example in CSV format, with the following command:
write.csv(rotif.mods$predictions, file = "predictions.csv")
If you want to try this out with your own species data, get the data
into R – for example, save your table in a text file,
with column names in the first row and columns
separated by tabulators, name the file
mydata.txt, save it in your R working directory (type
getwd()
to find out where it is), and then import it to R
with the following command:
mydata <- read.table("mydata.txt", header = TRUE, sep = "\t")
As noted above, if your data are point occurrences and raster maps, you can check out this blog post on how to convert them into tabular presence-absence data.
Then reproduce the operations above, but replacing rotif.env with mydata (or whatever name you’ve assigned in the above command) and specifying your column index numbers accordingly. If you use fuzzySim in publications, please cite the paper:
Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858 (DOI: 10.1111/2041-210X.12372).
That’s it! You can contact me with any suggestions or concerns, but first remember to check for updates to the package or to this tutorial at http://fuzzysim.r-forge.r-project.org. This tutorial was built with RStudio + rmarkdown + knitr. Thanks!