fuzzySim is an R package for computing fuzzy similarity in species distributions. It can also produce the underlying fuzzy distribution data, e.g. through interpolating or modelling species presence-absence data.

Installing and loading fuzzySim

fuzzySim works within the free and open-source R statistical software, so you first need to download, install and open R (available at http://www.r-project.org). In this tutorial, in monospaced font are the commands that you need to type (or copy and paste) into the R console (and then press the enter key to execute them). For commands that generate visible results in R, these are usually shown below them, preceded by hash marks (##). Note that all commands are case-sensitive, so you must respect upper- and lower-case letters; that you must always use straight (', ") rather than curly quotes and apostrophes; and that R is only ready to receive a new command when there’s a prompt sign (>) at the end of the R console; if not, it’s still waiting for an operation to be finished or for you to complete a previous command – watch out for unclosed parentheses or such, and press Esc for a fresh prompt.

You can install the latest CRAN version of fuzzySim by pasting the command below in the R console (when connected to the internet). You may need to provide additional information if R asks you, such as selecting a CRAN mirror to download from (choose any).

install.packages("fuzzySim")

If you instead need the latest development version of the package, install it from R-Forge:

install.packages("fuzzySim", repos = "http://R-Forge.R-project.org")

You only need to install the package once (unless a new version becomes available), but you need to load it every time you open a new R session in which you intend to use fuzzySim (no need for an internet connection anymore), by pasting the following command in R:

library(fuzzySim)

Preparing the species occurrence data

NOTE: This package works mainly with species occurrence data in presence-absence tables. If your data are point occurrences and raster maps, check out this blog post on how to convert them into tabular presence-absence data.

Load the rotifers sample dataset that comes with the fuzzySim package, to use as an example:

data(rotifers)

This dataset was originally published by Fontaneto, Barbosa, Segers & Pautasso in Ecography. You can get more information on the data (the following command should open an R Documentation window):

help(rotifers)

Show the first 10 rows of the rotifers dataset:

head(rotifers, 10)
##     TDWG4                          species
## 1  DEN-OO Brachionus_plicatilis_plicatilis
## 11 DEN-OO  Keratella_cochlearis_cochlearis
## 15 DEN-OO      Keratella_quadrata_quadrata
## 16 DEN-OO             Asplanchna_priodonta
## 17 DEN-OO   Brachionus_angularis_angularis
## 19 DEN-OO                Filinia_longiseta
## 24 DEN-OO          Brachionus_calyciflorus
## 32 FIN-OO             Asplanchna_priodonta
## 37 FIN-OO  Keratella_cochlearis_cochlearis
## 46 FIN-OO          Brachionus_calyciflorus

The first column contains the identifiers of the spatial units, which are TDWG level 4 region codes, and the second column contains the (sub)species names. These are a bit long, especially if we intend to use them as column names further on, so use the spCodes function to add to the rotifers dataset a column named spcode with species name abbreviations, consisting of the first letter of the genus + the first 5 letters of the specific name. Specify that the character separating words in the input species names is an underscore (_, as you’ve seen in the table above), and that the character you want separating the genus from the specific name code is empty (no separator):

rotifers$spcode <- spCodes(rotifers$species, sep.species = "_", nchar.gen = 1, nchar.sp = 5, nchar.ssp = 0, sep.spcode = "")
## OK - no duplicated spcodes found.

You can try the above with different options for nchar.gen, nchar.sp and nchar.ssp; the function will return an error message if the resulting codes are not unique for each species. Find out more details on this function with help(spCodes).

Now show the first 10 rows of the rotifers dataset after you’ve added the spcode column:

head(rotifers, 10)
##     TDWG4                          species spcode
## 1  DEN-OO Brachionus_plicatilis_plicatilis Bplica
## 11 DEN-OO  Keratella_cochlearis_cochlearis Kcochl
## 15 DEN-OO      Keratella_quadrata_quadrata Kquadr
## 16 DEN-OO             Asplanchna_priodonta Apriod
## 17 DEN-OO   Brachionus_angularis_angularis Bangul
## 19 DEN-OO                Filinia_longiseta Flongi
## 24 DEN-OO          Brachionus_calyciflorus Bcalyc
## 32 FIN-OO             Asplanchna_priodonta Apriod
## 37 FIN-OO  Keratella_cochlearis_cochlearis Kcochl
## 46 FIN-OO          Brachionus_calyciflorus Bcalyc

The rotifers dataset is in long format, listing in the same column the species that are present in each spatial unit. For analyzing distributional relationships with fuzzySim, we need a presence-absence table with species in separate columns (wide format). So, create a new table called rotifers.presabs with the rotifers presence-absence data converted to wide format and using spcodes as column names:

rotifers.presabs <- splist2presabs(rotifers, sites.col = "TDWG4", sp.col = "spcode", keep.n = FALSE)

Show the first rows of the result:

head(rotifers.presabs)
##    TDWG4 Abrigh Afissa Apriod Bangul Bcalyc Bplica Bquadr Burceo Cgibba Edilat
## 1 ABT-OO      0      0      1      1      0      0      0      0      0      0
## 2 AFG-OO      1      0      1      1      1      1      1      1      1      0
## 3 AGE-BA      1      1      0      1      1      1      1      1      0      1
## 4 AGE-CD      0      0      0      1      0      1      1      0      0      1
## 5 AGE-CH      0      0      0      1      0      0      1      0      1      0
## 6 AGE-CN      0      0      0      0      1      1      1      0      0      1
##   Flongi Kcochl Kquadr Ktropi Lbulla Lclost Lhamat Lluna Llunar Lovali Lpatel
## 1      1      1      1      0      0      0      0     1      1      1      0
## 2      0      1      1      1      1      1      1     1      0      1      1
## 3      1      1      1      1      1      1      1     1      1      1      0
## 4      1      1      0      0      1      1      0     1      1      1      0
## 5      1      0      0      1      1      1      1     0      1      1      0
## 6      1      0      1      0      0      0      0     1      0      1      0
##   Lquadr Mventr Ppatul Pquadr Pvulga Specti Tpatin Tsimil Ttetra
## 1      0      0      1      0      1      1      0      0      1
## 2      1      0      1      1      1      0      1      1      0
## 3      1      1      1      1      1      0      1      0      1
## 4      1      1      0      0      1      0      0      1      0
## 5      1      1      1      1      1      0      0      1      1
## 6      0      1      1      1      1      0      0      1      0

Mapping the species occurrence data

You can map these data if you have a map of the same spatial units and with the same unit identifiers; the TDWG maps are available online. You can use R commands to download the TDWG level 4 map into your current R session. This requires the terra package; the following command will install terra within your R installation if it’s not there already (and if you’re connected to the internet).

if (!("terra" %in% .packages(all.available = TRUE)))
  install.packages("terra")

Now load the terra package and create a map named tdwg by importing the online file, using the vect function of terra:

library(terra)
tdwg <- vect("https://raw.githubusercontent.com/tdwg/wgsrpd/master/geojson/level4.geojson")

The map is now in your R session. Now add the rotifers.presabs data to the tdwg map’s attribute table within R, matching them by the name of the column containing the region identifiers in each table, and keeping all rows of tdwg in the final result:

tdwg <- merge(tdwg, rotifers.presabs, 
              by.x = "Level4_cod", by.y = "TDWG4", 
              all.x = TRUE)

You are now ready to map the presence-absence of particular species from the rotifers.presabs table:

plot(tdwg, y = "Abrigh", col = rev(heat.colors(2)), main = "Abrigh occurrence records")

Try this also with y = other species in names(tdwg). As you can see in the maps and in the rotifers.presabs table, these are binary (either 0 or 1) presence-absence data, with missing values in regions with no rotifer occurrence records in this dataset.

Converting binary to fuzzy occurrence data

You can get a fuzzy (continuous between 0 and 1) version of presence-absence data using e.g. trend surface analysis or inverse distance interpolation. These require the spatial coordinates of each study unit, which the rotifers table doesn’t have. So, load the rotif.env sample dataset that also comes with the fuzzySim package, which is already in wide format and contains the geographical coordinates of the centroid of each TDWG4 unit:

data(rotif.env)

Take a look at its first rows (results not shown here):

head(rotif.env)

Show the column names of this dataset, to see which columns contain the species data and which contain the coordinates:

names(rotif.env)
##  [1] "TDWG4"                    "LEVEL_NAME"              
##  [3] "REGION_NAME"              "CONTINENT"               
##  [5] "Area"                     "Altitude"                
##  [7] "AltitudeRange"            "HabitatDiversity"        
##  [9] "HumanPopulation"          "Latitude"                
## [11] "Longitude"                "Precipitation"           
## [13] "PrecipitationSeasonality" "TemperatureAnnualRange"  
## [15] "Temperature"              "TemperatureSeasonality"  
## [17] "UrbanArea"                "Abrigh"                  
## [19] "Afissa"                   "Apriod"                  
## [21] "Bangul"                   "Bcalyc"                  
## [23] "Bplica"                   "Bquadr"                  
## [25] "Burceo"                   "Cgibba"                  
## [27] "Edilat"                   "Flongi"                  
## [29] "Kcochl"                   "Kquadr"                  
## [31] "Ktropi"                   "Lbulla"                  
## [33] "Lclost"                   "Lhamat"                  
## [35] "Lluna"                    "Llunar"                  
## [37] "Lovali"                   "Lpatel"                  
## [39] "Lquadr"                   "Mventr"                  
## [41] "Ppatul"                   "Pquadr"                  
## [43] "Pvulga"                   "Specti"                  
## [45] "Tpatin"                   "Tsimil"                  
## [47] "Ttetra"

You can see that species are in columns 18 to 47 and geographical coordinates are in columns 10 and 11. You can use either the names or the index numbers of these columns in the multTSA and distPres functions below. Mind that the coordinates must be specified to the function in the order x, y, i.e. Longitude, Latitude!

First, try a multiple trend surface analysis (TSA) for all species using a 3rd-degree polynomial with stepwise selection of terms:

spcols <- 18:47

rotifers.tsa <- multTSA(rotif.env, sp.cols = spcols, coord.cols = c("Longitude", "Latitude"), id.col = 1, degree = 3, step = TRUE)

You can find out more about trend surface analysis and the different options of the multTSA function by reading the help file that appears when you type help(multTSA). Now look at the first rows of the rotifers.tsa table created just above:

head(rotifers.tsa)

Results (not shown here) are continuous values representing the spatial trend in each species’ occurrence. You can confirm that they are bounded between 0 and 1 (so that they can be used in fuzzy logic) by checking the range of values in all columns except the first one (which contains region identifiers rather than species data):

range(rotifers.tsa[ , -1])
## [1] 9.408071e-05 9.996004e-01

Now add the TSA results to the tdwg map table:

tdwg <- merge(tdwg, rotifers.tsa, 
              by.x = "Level4_cod", by.y = "TDWG4", 
              all = TRUE)

Plot the results for the first species (and then others as you like; check names(tdwg) for available y options):

plot(tdwg, y = "Abrigh_TS", col = rev(heat.colors(256)), main = "Abrigh TSA")

The TSA depicts a general spatial trend in a species’ occurrence, but this may not be a faithful representation of its (fuzzy) occurrence area (compare with the presence-absence map shown before for the same species). You can try multTSA again with different polynomial degrees, with or without stepwise selection; or you can compute inverse distance to presence to use instead of TSA:

rotifers.invdist <- distPres(rotif.env, sp.cols = spcols, coord.cols = c("Longitude", "Latitude"), id.col = 1, p = 1, inv = TRUE, suffix = "_D")
## Warning in distPres(rotif.env, sp.cols = spcols, coord.cols = c("Longitude", :
## Distances may be inaccurate when CRS is not provided.

You can check help(distPres) for more information and options for this function (for example, you may want to use p=2 for a more conservative squared distance, especially if your spatial units are smaller). Check out the first rows of the resulting table (results not shown here):

head(rotifers.invdist)

You can check that the values in this table (excluding the first column, which has the region identifiers) also range between 0 and 1, so they can be used with fuzzy logic:

range(rotifers.invdist[ , -1])
## [1] 0 1

Note that inverse distance to presence is computed only for absence localities; distPres maintains the value 1 for presences. Now add these distances to the tdwg map table and plot the first species (then try other species as well):

tdwg <- merge(tdwg, rotifers.invdist, 
              by.x = "Level4_cod", by.y = "TDWG4", 
              all = TRUE)

plot(tdwg, y = "Abrigh_D", col = rev(heat.colors(256)), main = "Abrigh inverse distance")

This seems a more faithful portrait of our species’ distribution than the TSA. However, note that distance is also not always a good fuzzy representation of a species’ occurrence area, as geographical and environmental barriers may cause sharp local variations in species’ occurrence patterns.

Another way of obtaining fuzzy versions of species occurrence is to build distribution models based on the relationship between species presence/absence and a set of geographical, environmental and/or human variables. This can be done for multiple species simultaneously with the multGLM function and the variables in rotif.env. You must specify the name of the dataset and the names or index numbers of the columns containing the species data and the variables. There are a number of options on how to select variables for the models – you can type help(multGLM) or read this modelling tutorial for further details. Here, we’ll just see how to create models with the multGLM default options:

rotifers.fav <- multGLM(data = rotif.env, sp.cols = spcols, var.cols = 5:17, id.col = 1)

The object returned by multGLM is a list containing three elements: a dataframe with the resulting predictions, a list of models, and a list of vectors naming the variables included in each model. Check out the first rows of the predictions dataframe (results not shown here):

head(rotifers.fav$predictions)

Predicted presence probability is in the columns with suffix P, but here we’re more interested in the columns with suffix F, containing environmental favourability values, which are directly comparable among species (whereas probability is affected by species prevalence; see Details in help(Fav) for more info).

names(rotifers.fav$predictions)
##  [1] "TDWG4"    "Abrigh_P" "Afissa_P" "Apriod_P" "Bangul_P" "Bcalyc_P"
##  [7] "Bplica_P" "Bquadr_P" "Burceo_P" "Cgibba_P" "Edilat_P" "Flongi_P"
## [13] "Kcochl_P" "Kquadr_P" "Ktropi_P" "Lbulla_P" "Lclost_P" "Lhamat_P"
## [19] "Lluna_P"  "Llunar_P" "Lovali_P" "Lpatel_P" "Lquadr_P" "Mventr_P"
## [25] "Ppatul_P" "Pquadr_P" "Pvulga_P" "Specti_P" "Tpatin_P" "Tsimil_P"
## [31] "Ttetra_P" "Abrigh_F" "Afissa_F" "Apriod_F" "Bangul_F" "Bcalyc_F"
## [37] "Bplica_F" "Bquadr_F" "Burceo_F" "Cgibba_F" "Edilat_F" "Flongi_F"
## [43] "Kcochl_F" "Kquadr_F" "Ktropi_F" "Lbulla_F" "Lclost_F" "Lhamat_F"
## [49] "Lluna_F"  "Llunar_F" "Lovali_F" "Lpatel_F" "Lquadr_F" "Mventr_F"
## [55] "Ppatul_F" "Pquadr_F" "Pvulga_F" "Specti_F" "Tpatin_F" "Tsimil_F"
## [61] "Ttetra_F"

You can see that favourability values are in columns 32 to 61 of the rotifers.fav$predictions table. Now add these values to the tdwg map table and plot e.g. favourability for the first species:

tdwg <- merge(tdwg, rotifers.fav$predictions, 
              by.x = "Level4_cod", by.y = "TDWG4", 
              all = TRUE)

plot(tdwg, y = "Abrigh_F", 
     col = rev(heat.colors(256)), 
     main = "Abrigh favourability")

Computing (fuzzy) similarity among species occurrence patterns

Let’s now use these favourability model predictions as our fuzzy occurrence values. First, get a matrix of pair-wise fuzzy similarity among these fuzzy species’ distributions, by comparing these columns with e.g. the fuzzy Jaccard similarity index. Since fuzzySim version 4.24, we can use the argument plot = TRUE to get also a quick viz of which species pairs are more and less similar in distribution. You can optionally provide some extra arguments to customize this plot, such as main = "Fuzzy similarity matrix" or col = hcl.colors(100, palette = "blues").

fuz.sim.mat <- simMat(rotifers.fav$predictions[ , 32:61], method = "Jaccard", plot = TRUE)

Let’s take a look at the first rows of the resulting fuzzy similarity matrix (results not shown here):

head(fuz.sim.mat)

You can also build a cluster dendrogram from the similarity matrix, using the hclust R function. The clustering method requires a distance matrix, so our similarity matrix is subtracted from 1 in the command below. The method=“average” argument means that UPGMA is the clustering algorithm, but you can check for other options with help(hclust):

fuz.dendro <- hclust(as.dist(1 - fuz.sim.mat), method = "average")

plot(fuz.dendro, main = "Fuzzy dendrogram", xlab = "", ylab = "", sub = "Fuzzy Jaccard index\nUPGMA clustering")

You can also build a similarity matrix from the original binary presence-absence data, to compare with the fuzzy similarity results. First, check again the column names of rotif.env, to see where the species columns are, so that you can specify them correctly to the simMat function:

names(rotif.env)
##  [1] "TDWG4"                    "LEVEL_NAME"              
##  [3] "REGION_NAME"              "CONTINENT"               
##  [5] "Area"                     "Altitude"                
##  [7] "AltitudeRange"            "HabitatDiversity"        
##  [9] "HumanPopulation"          "Latitude"                
## [11] "Longitude"                "Precipitation"           
## [13] "PrecipitationSeasonality" "TemperatureAnnualRange"  
## [15] "Temperature"              "TemperatureSeasonality"  
## [17] "UrbanArea"                "Abrigh"                  
## [19] "Afissa"                   "Apriod"                  
## [21] "Bangul"                   "Bcalyc"                  
## [23] "Bplica"                   "Bquadr"                  
## [25] "Burceo"                   "Cgibba"                  
## [27] "Edilat"                   "Flongi"                  
## [29] "Kcochl"                   "Kquadr"                  
## [31] "Ktropi"                   "Lbulla"                  
## [33] "Lclost"                   "Lhamat"                  
## [35] "Lluna"                    "Llunar"                  
## [37] "Lovali"                   "Lpatel"                  
## [39] "Lquadr"                   "Mventr"                  
## [41] "Ppatul"                   "Pquadr"                  
## [43] "Pvulga"                   "Specti"                  
## [45] "Tpatin"                   "Tsimil"                  
## [47] "Ttetra"

Now compute the binary similarity matrix from the columns containing the species occurrence data:

bin.sim.mat <- simMat(rotif.env[ , spcols], method = "Jaccard") 

You can try and repeat the operations exemplified before, but replacing fuz.sim.mat with bin.sim.mat, to visualize the binary similarity matrix and the resulting dendrogram. You can also compare the fuzzy and binary similarity matrices using the mantel function of the vegan package. If you want to do this, the command below will install vegan if you don’t already have it within your R installation:

if (!("vegan" %in% .packages(all.available = TRUE))) install.packages("vegan")

Now load vegan into the current R session and compute the Mantel correlation between the two matrices (type help(mantel) for more info and options):

library(vegan) 
mantel(bin.sim.mat, fuz.sim.mat, method = "spearman")
## 
## Mantel statistic based on Spearman's rank correlation rho 
## 
## Call:
## mantel(xdis = bin.sim.mat, ydis = fuz.sim.mat, method = "spearman") 
## 
## Mantel statistic r: 0.6778 
##       Significance: 0.001 
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.177 0.229 0.270 0.358 
## Permutation: free
## Number of permutations: 999

Computing (fuzzy) similarity among regional species pools

Besides comparing species according to their (fuzzy) occurrence patterns, you can also compare regions according to their (fuzzy) species composition. For this, you only need to transpose the (fuzzy) species occurrence matrix, so that regions go in columns and species in rows. With the transpose function of fuzzySim, you can do this directly from the complete tables, specifying which columns contain the data to transpose, and which column contains the region names to use as column names in the transposed table:

names(rotif.env)

bin.reg <- transpose(rotif.env, sp.cols = spcols, reg.names = 1)

names(rotifers.fav$predictions)

fuz.reg <- transpose(rotifers.fav$predictions, sp.cols = 32:61, reg.names = 1)

Look at the first rows of the resulting tables (results not shown here):

head(bin.reg)

head(fuz.reg)

Now create the pair-wise similarity matrices for both binary and fuzzy species composition in these regions. These matrices will take longer to compute because there are (in this dataset) many more regions than species, so there are many more pair-wise comparisons to make now:

bin.reg.sim.mat <- simMat(bin.reg, method = "Jaccard")

fuz.reg.sim.mat <- simMat(fuz.reg, method = "Jaccard")

Then you can proceed as you did before with the species distributional similarity matrices, to plot, compare and build cluster dendrograms of these data.

Further steps

Fuzzy similarity matrices can be entered in the RMACOQUI package for a systematic analysis of chorotypes (significant clusters of species distribution types) or of biotic / biogeographic regions (significant clusters of regional species compositions).

Using your own data

If you want to try this out with your own data, get your table (with column names in the first row) in a text file named mydata.txt and separated by tabs, save it in your R working directory (type getwd to find out where it is), and then import it to R using the following command:

mydata <- read.table("mydata.txt", header = TRUE, sep = "\t")

The sep = "\t" argument above indicates that the columns in the text file are separated by tabulators, but you can specify other separators as necessary, such as spaces (sep = " "), commas (sep = ",") or semicolons (sep = ";").

As noted above, if your data are point occurrences and raster maps, check out this blog post on how to convert them to tabular presence-(pseudo)absence data.

Then reproduce all the operations above, but replacing rotifers or rotif.env (depending on if your data are in long or wide format, respectively) with mydata (or whatever name you’ve assigned in the command above) and specifying column names or numbers accordingly.

If you use fuzzySim in publications, please cite the paper:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858 (DOI: 10.1111/2041-210X.12372).

That’s it! You can contact me if you have any suggestions or concerns, but first remember to check for updates to the package or to this tutorial at http://fuzzysim.r-forge.r-project.org. This tutorial was built with RStudio + rmarkdown + knitr. Thanks!