Category Archives: statistics

Adjusted R2 in partial constrained ordination: the difference between R (vegan) and CANOCO 5

Recently I used R package vegan to recalculate several partial RDA analyses, which were originally calculated in CANOCO 5 (C5). Partial RDA (pRDA) is a constrained ordination which contains both explanatory variables and covariables. To my surprise, the values of explained variation (R2) and adjusted explained variation (adjR2) differ between vegan and C5, with values in vegan being systematically lower. Note that this difference will occur only in case of partial constrained ordination – if you do not have covariables, then R2 and adjR2 in vegan will be the same as R2 and adjR2 in C5.

After some searching and a help from Petr Šmilauer, I realized that vegan and C5 differ in the way how they calculate R2 and adjR2. In R, vegan is calculating so called semipartial R2, in which the denominator is total inertia only (i.e. pure variation explained by explanatory variables / total inertia). On the other hand, C5 calculates so called partial R2, in which variation explained by explanatory variables is divided by total inertia without the effect of covariables (i.e. pure variation explained by explanatory variables / (total inertia – variation explained by covariables)). Note that “pure variation explained by explanatory variables” is the amount of variation explanatory variables explain after removing the effect of covariables. More on this e.g. in Legendre & Legendre (2012), page 651.

In case of adjusted R2, the situation is also different, and slightly more complex. In both programs adjusted R2 is calculated using Ezekiel’s formula adjR2 = 1 – (1 – R2)*(n – 1)/(nm – 1), where n is number of samples in the dataset, and m is number of variables (note that if some of these variables is a factor, it brings “number of factor levels – 1” degrees of freedom).

In vegan, adjusted R2 of explained variation after removing the effect of covariables is calculated as fraction [a] in variance partitioning between explanatory variables and covariables, where [a + b] is variation explained by explanatory variables, and [b + c] variation explained by covariables. Then, adjR2[a] = adjR2[a+b+c] – adjR2[b+c], i.e. adjR2 of model containing both explanatory variables and covariables minus adjR2 of model containing only covariables. If p is the number of covariables (or number of degrees of freedom relevant to covariables, in case that some of them is a factor), than

adjR2[a] = adjR2[a+b+c] – adjR2[b+c] = (1 – (1 – R2[a+b+c])*(n – 1)/(nmp – 1)) – (1 – (1 – R2[b+c])*(n – 1)/(np – 1))

In C5, the adjusted R2 explained by explanatory variables after considering covariables is not calculated by subtracting adjR2 of the covariables from the full model, but directly by adjusting the original R2 values using modified number of degrees of freedom in the formula. Hence,

adjR2 = 1 – (1 – R2)*(np – 1)/(nmp – 1)

which means that the number of degrees for covariables is subtracted at both numerator and denominator of the right hand formula side.

Let’s see the difference on the real example. We will use Vltava dataset, and build the pRDA in both vegan and C5.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
library (vegan)
# download Vltava dataset from http://www.davidzeleny.net/anadat-r/
vltava.spe <- read.delim ('http://www.davidzeleny.net/anadat-r/data-download/vltava-spe.txt', row.names = 1)
vltava.env <- read.delim ('http://www.davidzeleny.net/anadat-r/data-download/vltava-env.txt')
vltava.spe.hell <- decostand (vltava.spe, method = 'hell')
 
# Build partial RDA model with
pRDA <- rda (vltava.spe.hell ~ pH + ASPSSW + Condition (SOILDPT + SLOPE),
data = vltava.env)
 
pRDA
# Call: rda(formula = vltava.spe.hell ~ pH + ASPSSW + Condition(SOILDPT + SLOPE),
#           data = vltava.env)
#
# Inertia Proportion Rank
# Total         0.71576    1.00000
# Conditional   0.04264    0.05957    2
# Constrained   0.06245    0.08725    2
# Unconstrained 0.61067    0.85318   92
# Inertia is variance
#
# Eigenvalues for constrained axes:
#   RDA1    RDA2
# 0.03810 0.02435
#
# Eigenvalues for unconstrained axes:
#   PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8
# 0.06648 0.06138 0.04468 0.03336 0.02994 0.02693 0.02328 0.02114
# (Showed only 8 of all 92 unconstrained eigenvalues)
 
RsquareAdj (pRDA)
# $r.squared
# [1] 0.0872482
#
# $adj.r.squared
# [1] 0.07016251
 
# First, let's define variables with explained variations:
constrained <- pRDA$CCA$tot.chi  # 0.06244857 - variation explained by
# explanatory variables after removing the effect of covariables
conditional <- pRDA$pCCA$tot.chi # 0.04263834 - variation explained by
# covariables
total <- pRDA$tot.chi # 0.7157577 - total variation of the compositional dataset
 
# In R, the R2 is calculated as constrained variation / total inertia:
constrained / total
# [1] 0.0872482
 
# and adjusted R2 is calculating by subtracting adjR2 explained by covariables
# from adjR2 explained by both explanatory variables and covariables:
R2abc <- (constrained + conditional) / total
R2bc <- conditional / total
n <- nrow (vltava.spe) # number of samples in the dataset
m <- 2 # number of explanatory variables
p <- 2 # number of covariables
 
adjR2abc <- 1 - (1 - R2abc)*(n - 1)/(n - m - p - 1)
adjR2bc <- 1 - (1 - R2bc)*(n - 1)/(n - p - 1)
adjR2a <- adjR2abc - adjR2bc
adjR2a
# [1] 0.07016251

In CANOCO 5, if the same dataset and the same variables and covariables are used in the same pRDA analysis on Hellinger transformed data, the results are different (if you have CANOCO 5 installed on your computer, download and open this file):

Analysis 'Constrained-partial'
Method: partial RDA
Partial variation is 64.61974, explanatory variables account for   9.3%
(adjusted explained variation is   7.3%)

Summary Table:
  Statistic    Axis 1    Axis 2    Axis 3    Axis 4
Eigenvalues    0.0532    0.0340    0.0929    0.0858
Explained variation (cumulative)    5.66    9.28    19.15    28.27
Pseudo-canonical correlation    0.7624    0.6491    0.0000    0.0000
Explained fitted variation (cumulative)    61.01    100.00        

Permutation Test Results:
  On All Axes    pseudo-F=4.7, P=0.001

As you can see, R2 is 9.3%, while adjR2 is 7.3%, in contrast to vegan‘s 8.7% and 7.0% for R2 and adjR2, respectively. The sum of eigenvalues for the two constrained axes is 0.0532+0.0340 = 0.0872, the same value as the R2 in vegan, but in the header, C5 reports higher values. These are calculated in the following way (continuing the R code above):

1
2
3
4
5
6
7
8
# R2 in CANOCO as
R2.C5 <- constrained / (total - conditional)
R2.C5
# [1] 0.09277488 - after rounding 9.3%
 
# adjR2 in CANOCO as
1 - (1 - R2.C5)*(n - p - 1)/(n - m - p - 1)
# [1] 0.0730526 - after rounding 7.3

Note that the other parameters are identical in both vegan and CANOCO (eigenvalues, F-value and also the significance).

In the end, I wrote this quick-and-dirty hack of the vegan‘s RsquareAdj function, which includes argument type with the values either “partial” (vegan’s version) or “semipartial” (C5 version):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
RsquareAdj2.rda <- function (x, type = 'semipartial', ...)
{
m <- x$CCA$qrank
n <- nrow(x$CCA$u)
if (is.null(x$pCCA)) {
R2 <- x$CCA$tot.chi/x$tot.chi
radj <- RsquareAdj(R2, n, m)
}
else if (type == 'semipartial') {
R2 <- x$CCA$tot.chi/x$tot.chi
R2p <- x$pCCA$tot.chi/x$tot.chi
p <- x$pCCA$rank
radj <- RsquareAdj(R2 + R2p, n, m + p) - RsquareAdj(R2p, n, p)
} else if (type == 'partial'){
R2 <- x$CCA$tot.chi/(x$tot.chi - x$pCCA$tot.chi)
p <- x$pCCA$rank
radj <- 1 - (1 - R2)*(n - p - 1)/(n - m - p - 1)
if (any(na <- m >= n - 1)) radj[na] <- NA
}
list(r.squared = R2, adj.r.squared = radj, type = type)
}

Quick check whether it works:

1
2
RsquareAdj2.rda (pRDA, type = 'semipartial')  # 0.08724 vs 0.0701
RsquareAdj2.rda (pRDA, type = 'partial')      # 0.09277 vs 0.07305

Looks like it does. Note, however, if you want to use it, it needs to be further tested (e.g. with datasets of environmental variables include also factors as explanatory variables or covariables).

TWINSPAN in R

TWINSPANTWINSPAN is perhaps one of the most popular clustering methods (at least among vegetation ecologists), which is not implemented in R. R-sig-eco forum has several posts (mostly from Jari Oksanen and Dave Roberts) on the topic of TWINSPAN in R, where they described difficulties with importing the original TWINSPAN code (written in FORTRAN) into R. Seems that both Jari and Dave spent considerable effort trying to implement the method into R, but seems like there is some problem which is not easy to crack.

From my experience (and I guess also from experience of many others, not only vegetation ecologists), TWINSPAN does sometimes give rather nice and ecologically meaningful results, since it is based on cutting the data along the main compositional gradients. There is yet another divisive method, somewhat analogous to TWINSPAN, called DIANA (DIvisive ANAlysis clustering), which was proposed by Macnaughton-Smith et al. (1964), described in detail by Kaufman & Rousseeuw (1990), and made available in series of Fortran written programs with poetic names like AGNES, CLARA, DAISY, DIANA, or FANNY (later implemented into R package cluster). But it seems to me that this method for some reason never gained so much popularity like TWINSPAN did, mostly perhaps due to its sensitivity to outlying samples (which will be separated first into one-item clusters). This is why I think it still would be nice to have TWINSPAN handy in R, also because of its modified version, which we introduced couple years ago (Roleček et al. 2009) and which up to now is available only in JUICE, program for editing and analysis of vegetation data developed by Lubomír Tichý (Tichý 2002). Modified TWINSPAN is basically a sequence of divisions calculated by standard TWINSPAN, each time applied on the most heterogeneous group – here is again a similarity with DIANA, which also in each step divides the group which is the most compositionally heterogeneous.

Recently, I got a simple idea, how to make TWINSPAN work in R, and how to extend this implementation also for modified TWINSPAN. The way I used is somewhat clumsy, but it seems to work. I used the twinspan.exe file, which is an executable program based on original FORTRAN library written by Mark O. Hill (author of the algorithm and the original FORTRAN code) and compiled in 2003 by Stephan M. Hennekens into a stand-alone program for use in MEGATAB, software which before was used together with TURBOVEG for editing and classification of vegetation data (Hennekens & Schaminée 2001, Schaminée & Hennekens 2001). I created R package twinspanR, which includes twinspan.exe, and added bunch of supporting functions to maintain import and export of data to twinspan.exe and back to R. So basically the TWINSPAN is calculated by the original Hill’s algorithm, and R functions in twinspanR package are for handling the whole thing conveniently in R (see the notes below for more technical details how the library exactly works).

For details how to install the library from GitHub see the Readme.md file. Some more examples how to use twinspan package will be (hopefully) made soon available on my website for analysis of community ecology data in R.

The following is an example code, using example dataset danube with Ellenberg’s meadow dataset (used also as an example in the first publication of TWINSPAN and DCA by Hill 1979 and Hill & Gauch 1980, respectively). In this example I used modified TWINSPAN (Roleček et al. 2009) with division into 4 groups and heterogeneity of clusters measured by Bray-Curtis dissimilarity measure. I projected the results into DCA ordination diagram, along to the original tabular classification made manually by Ellenberg (from Mueller-Dombois & Ellenberg 1974):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## Modified TWINSPAN on traditional Ellenberg's Danube meadow dataset,
## projected on DCA and compared with original classification into
## three vegetation types made by tabular sorting:
library (twinspanR)
library (vegan)
data (danube)
res <- twinspan (danube$spe, modif = TRUE, clusters = 4)
k <- cut (res)
 
dca <- decorana (danube$spe)
par (mfrow = c(1,2))
ordiplot (dca, type = 'n', display = 'si', main = 'Modified TWINSPAN')
points (dca, col = k)
for (i in c(1,2,4)) ordihull (dca, groups = k, show.group = i, col = i,
 draw = 'polygon', label = TRUE)
ordiplot (dca, type = 'n', display = 'si', main = 'Original assignment\n (Ellenberg 1954)')
points (dca, col = danube$env$veg.type)
for (i in c(1:3)) ordihull (dca, groups = danube$env$veg.type,
 show.group = unique (danube$env$veg.type)[i], col = i,
 draw = 'polygon', label = TRUE)

example twinspan figure

Some technical details how the twinspanR package works

Some further details how it works. The executable file, twinspan.exe, is stored in \exec subdirectory of the library. There is also tw.bat file, which launches twinspan.exe and feeds it with data from R (I have shamelessly stolen this idea from the way how Lubomír Tichý executes TWINSPAN in JUICE, and it is also similar to the way how Tom August implemented another Hill’s FORTRAN library, FRESCALO (Hill 2011), into R as a function frescalo in package sparta). I take compositional data in R, transform them to required *.cc! format (luckily there is a function write.CEP in rioja package, written by Steve Juggins) and save them to \exec subdirectory (where also twinspan.exe is located). Then, I create file tw.dat with input parameters for twinspan.exe (using function create.tw.dat), and use the shell function in R to launch the tw.bat file. All calculations are done using original twinspan.exe, R just reads its output from tw.PUN file. It’s in a no way an elegant approach, but works just fine.

The twinspan.exe file used for calculation of TWINSPAN in R is taken from the distribution of JUICE program (Tichý 2002). It is the original FORTRAN code written by Mark O. Hill, compiled (cca in 2002) by Stephan M. Hennekens into twinspan.exe for use in his program MEGATAB (which was used together with Turboveg for analysis of community data). The version of twinspan.exe used here implements the changes by Petr Šmilauer, related mainly to problems with algorithm convergence, which cause the results being dependent on the order of samples in the input data table. Note that this algorithm is slightly different from TWINSPAN implemented in WinTWINS software (Hill & Šmilauer 2005), which implements also other modifications by Birks and ter Braak. In fact, there are at least four different versions of TWINSPAN recently used for analysis, and in certain circumstances they differ in the results how they classify the samples (when I have time, I will elaborate this topic further).

Unfortunately, the presence of twinspan.exe file in the twinspanR library is a problem for its portability – seems like CRAN doesn’t allow packages with executables inside (for understandable security reasons), and R-Forge allows it to be uploaded, but fails to build it. For now, the channel for distribution of this package is GitHub, and it will perhaps remain there until some other solution will show up. Simply, it’s just a quick and dirty way how to get TWINSPAN functionality in R without too much hassle, before somebody manages to write fully functional implementation of TWINSPAN in R.

References

  • Hennekens S.M. & Schaminée J.H.J. (2001): TURBOVEG, a comprehensive data base management system for vegetation data. Journal of Vegetation Science, 12: 589-591.
  • Hill M.O. (1979): TWINSPAN: A FORTRAN Program for Arranging Multivariate Data in an Ordered Two-way Table by Classification of the Individuals and Attributes. Cornell University, Ithaca, NY.
  • Hill M.O. & Šmilauer P. (2005): TWINSPAN for Windows version 2.3. Centre for Ecology and Hydrology & University of South Bohemia, Huntingdon & České Budějovice.
  • Hill M.O. & Gauch H.G. (1980): Detrended correspondence analysis: An improved ordination technique. Vegetatio, 42: 47-58.
  • Kaufman L. & Rousseeuw P. J. (2009): Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
  • Macnaughton-Smith P., Williams W. T., Dale M. B. & Mockett L. G. (1964): Dissimilarity analysis: a new technique of hierarchical sub-division. Nature, 202: 1034-1035.
  • Mueller-Dombois D. & Ellenberg H. (1974): Aims and Methods of Vegetation Ecology. John Wiley & Sons, New York, Chichester, Brisbane, Toronto.
  • Roleček J., Tichý L., Zelený D. & Chytrý M. (2009): Modified TWINSPAN classification in which the hierarchy respects cluster heterogeneity. Journal of Vegetation Science, 20:596-602.
  • Schaminée J.H.J & Hennekens S.M. (2001): TURBOVEG, MEGATAB und SYNBIOSYS: neue Entwicklungen in der Pflanzensoziologie. Ber. Reinhold-Tüxen Ges., 13: 21-34.
  • Tichý L. (2002): JUICE, software for vegetation classification. Journal of Vegetation Science, 13: 451-453.