Simpson’s similarity index vs Simpson’s diversity index

Since I started to teach multivariate methods in community ecology and additionally to chapters about ordination I also included the chapter about diversity, I found it confusing that there are two “Simpson’s indices,” one for similarity, and one for diversity. Indeed, each index has completely different meaning and use, and I may well be the only one confused by them, but for a while, I had an impression that both indices are somehow related and maybe could be derived from each other. I made some search to clarify it for myself, and it turned out that although both indices have been published in roughly the same time (in the 40’s), each one is attributed to different Mr. Simspon and appeared in quite a different context.

A bit of theory first. Simpson’s similarity index is used to calculate the similarity between a pair of community samples, to quantify whether their species composition is similar (they share most or all the species) or different. It comes in the company with other, similar indices, like Jaccard and Sørensen, which do the same job but with slightly different logic. If we have two samples, there will be some number of species shared between both, and some species occurring only in the first or only in the second sample. Jaccard or Sørensen divides the number of species both samples share by the number of all species occurring in both samples together (and Sørensen cares about shared species more than Jaccard, so it multiplies it twice in both numerator and denominator). The problem occurs if one of the samples has much higher species richness than the other; the number of shared species (in the numerator) may be the same, but the high richness of one of the samples will increase the denominator and decrease the overall index value, making the index losing the resolution power. Here comes the Simpson’s similarity index, which has the same nominator (number of shared species), but in the denominator, it takes only the number of species occurring in the smaller sample. Its use is recommended in situation when there are large differences in species richness between the samples, and in fact it turned out that this index is also the one quantifying the nestedness of two community samples (if the first sample contain the same species as the other one plus some more, the second sample is perfectly nested within the first one, and the value of the Simpson’s similarity index is equal to one).

Simpson’s diversity index is completely different story. It aims to quantify diversity of a single community sample, while considering both species richness (the number of species recorded in the sample) and also the relative species abundances (some species within the sample can be dominant, represented by high number of individuals/higher biomass/larger cover, while some other are less common or rare). Two communities may have the same species richness (i.e. number of species), but if they differ by relative species abundance (i.e. evenness), their overall diversity will be different. If we have two forests, both with 10 species (richness), but first with all species represented by the same number of individuals (perfectly even), while the second with one or two dominant trees (highly uneven), the first will intuitively (and also functionally) be more diverse – a walk through such forest will give you a feeling of diversity, since you keep meeting different trees, while the forest dominated by a single or few species will feel species poor (rare species are also there, so richness is the same, but they are not likely to be encountered as frequently as and also they are not likely to have an important ecological function in the community). Simpson’s diversity index, when calculating the diversity, is weighing the importance of species richness and relative species abundances. It comes in the company of his famous sibling, Shannon index, which does the same thing in a slightly different way. While Simpson’s index cares more about relative abundances, the Shannon index cares more about species richness; or, put in another way, the importance of rare species decreases in order species richness > Shannon index > Simpson index.

Now about the name. Both indices share the author’s surname, but in each case, it was different Mr. Simspon. Simpson’s similarity index was introduced by an American paleontologist George Gaylord Simpson (1902-1984) to solve the situation often encountered in his practice with fossil samples, where some samples are much richer than the others, meaning that conventional similarity indices (like Jaccard and Sørensen) do not really work. In 1943, G.G. Simpson wrote a paper Mammals and the nature of continents and published it in American Journal of Science; later, in 1960, he published (in the same journal) paper Notes on the measurement of faunal resemblance, where he also provided the formula for his index. The index has been rediscovered by Jack J. Lennon et al. in 2001 as βsim (often called beta sim; I suspect “sim” stems from “Simpson”). Note that Lennon et al. (2001) does not talk about “similarity”, but instead they used Simpson’s formula to calculate mean pairwise dissimilarities between focal sample and the other samples; in Koleff et al. (2003), however, beta sim is already considered as a pairwise dissimilarity measure based on the original Simpson’s index (1 – Simpson).

Simpson’s diversity index (also called concentration index) was published by British statistician Edward Hugh Simpson (born 1922, living in Oxfordshire), who is mostly famous for his formulation of Simpson’s paradox. E.H.Simpson published the index in the 1949’s Nature’s paper entitled Measurement of diversity. It quantifies the probability that two individuals took independently (without replacement) from the sample will be of the same species. With increasing diversity the index actually decreases (two individuals have the highest chance to be of the same species if the sample actually has only one species), so more often the one complement (1-Simpson) or reciprocal (1/Simpson) values are used, also known as Gini-Simpson index or Hurlbert’s PIE (probability of interspecific encounter, Hurlbert 1971).

So here we go – no reason to confuse them anymore!


  • Hurlbert, S.H. 1971. The nonconcept of species diversity: A critique and alternative parameters. Ecology, 52: 577–586.
  • Koleff, P. Gaston, K.J. & Lennon, J.J. 2003. Measuring beta diversity for presence-absence data. Journal of Animal Ecology, 72: 367-382.
  • Lennon, J.J., Koleff, P., Greenwood, J.J.D. & Gaston, K.J. 2001. The geographical structure of British bird distributions: diversity, spatial turnover and scale. Journal of Animal Ecology, 70: 966-979.
  • Simpson, E.H. 1949 Measurement of diversity. Nature, 163: 688.
  • Simpson, G.G. 1943. Mammals and the nature of continents. American Journal of Science, 241: 1-31.
  • Simpson, G.G. 1960. Notes on the measurement of faunal resemblance. American Journal of Science, 258-A: 300-311.

VegLab trip to Hehuanshan

October 13-15, 2016, we made our first VegLab field trip – a three-day trip to Hehuanshan to see the high-mountain non-forest vegetation. At the elevation around 3,200 m a.s.l., there is a mosaic of forest (dominated mostly by Abies kawakamii and Tsuga chinensis) and non-forest vegetation, mostly dominated by a bamboo Yushania niitakayamensis, with scattered Juniperus and Rhododendron shrubs, and a number of high-elevation herb species. Along the way, we discussed whether this is really a climatic timberline and what we see are actually alpine grasslands, or whether the non-forest vegetation is more likely due to some major disturbances (erosion, fire?). Several recent studies in this region (actually exactly at the place we hiked) studied the uplift of the timberline as a result of climate change (e.g. Greenwood et al. 2014 in Global Change Biology). But isn’t the advancing forest just a result of succession after disturbance? The key argument would be to know how frequent and how intensive are fires in this region. Is there a way to find details about them?

We slept for two days at the bottom of the valley, hiking along the stream to see the valley vegetation, and hiking uphill to the Yushania grasslands, still with some beautiful flowers. Although the weather was not really user-friendly (foggy the first night, raining hard the last day), I think the trip was successful; now we have a better idea how the timberline in the Hehuanshan region actually looks like.

Vegetation along the stream. Photo: D. Zelený.

Yen-Cheng and Tsung-Yi crossing the stream. The water was cold like hell… Photo: D. Zelený.

Campsite. Nice valley, full moon at night, foggy mornings. Photo: D. Zelený.

Cushion-like Artemisia. Not sure which species. On the gravel bar along the stream. Photo: D. Zelený.

Making relevé. We made a bet how many species will be in the plot 2×2 m large. Yen-Cheng bet 5, Tsung-Yi 10 and David 15. We found 12 – Tsung-Yi was the most close (although David spent considerable time trying to find at least one more species to beat him :). Photo: D. Zelený.

Cooking dinner – rice with beef and vegetable. Photo: D. Zelený.

Finito! Totally wet and hungry, but we made it! Photo: D. Zelený.

Adjusted R2 in partial constrained ordination: the difference between R (vegan) and CANOCO 5

Recently I used R package vegan to recalculate several partial RDA analyses, which were originally calculated in CANOCO 5 (C5). Partial RDA (pRDA) is a constrained ordination which contains both explanatory variables and covariables. To my surprise, the values of explained variation (R2) and adjusted explained variation (adjR2) differ between vegan and C5, with values in vegan being systematically lower. Note that this difference will occur only in case of partial constrained ordination – if you do not have covariables, then R2 and adjR2 in vegan will be the same as R2 and adjR2 in C5.

After some searching and a help from Petr Šmilauer, I realized that vegan and C5 differ in the way how they calculate R2 and adjR2. In R, vegan is calculating so called semipartial R2, in which the denominator is total inertia only (i.e. pure variation explained by explanatory variables / total inertia). On the other hand, C5 calculates so called partial R2, in which variation explained by explanatory variables is divided by total inertia without the effect of covariables (i.e. pure variation explained by explanatory variables / (total inertia – variation explained by covariables)). Note that “pure variation explained by explanatory variables” is the amount of variation explanatory variables explain after removing the effect of covariables. More on this e.g. in Legendre & Legendre (2012), page 651.

In case of adjusted R2, the situation is also different, and slightly more complex. In both programs adjusted R2 is calculated using Ezekiel’s formula adjR2 = 1 – (1 – R2)*(n – 1)/(nm – 1), where n is number of samples in the dataset, and m is number of variables (note that if some of these variables is a factor, it brings “number of factor levels – 1” degrees of freedom).

In vegan, adjusted R2 of explained variation after removing the effect of covariables is calculated as fraction [a] in variance partitioning between explanatory variables and covariables, where [a + b] is variation explained by explanatory variables, and [b + c] variation explained by covariables. Then, adjR2[a] = adjR2[a+b+c] – adjR2[b+c], i.e. adjR2 of model containing both explanatory variables and covariables minus adjR2 of model containing only covariables. If p is the number of covariables (or number of degrees of freedom relevant to covariables, in case that some of them is a factor), than

adjR2[a] = adjR2[a+b+c] – adjR2[b+c] = (1 – (1 – R2[a+b+c])*(n – 1)/(nmp – 1)) – (1 – (1 – R2[b+c])*(n – 1)/(np – 1))

In C5, the adjusted R2 explained by explanatory variables after considering covariables is not calculated by subtracting adjR2 of the covariables from the full model, but directly by adjusting the original R2 values using modified number of degrees of freedom in the formula. Hence,

adjR2 = 1 – (1 – R2)*(np – 1)/(nmp – 1)

which means that the number of degrees for covariables is subtracted at both numerator and denominator of the right hand formula side.

Let’s see the difference on the real example. We will use Vltava dataset, and build the pRDA in both vegan and C5.

[sourcecode language=”r”]
library (vegan)
# download Vltava dataset from
vltava.spe <- read.delim (‘’, row.names = 1)
vltava.env <- read.delim (‘’)
vltava.spe.hell <- decostand (vltava.spe, method = ‘hell’)

# Build partial RDA model with
pRDA <- rda (vltava.spe.hell ~ pH + ASPSSW + Condition (SOILDPT + SLOPE),
data = vltava.env)

# Call: rda(formula = vltava.spe.hell ~ pH + ASPSSW + Condition(SOILDPT + SLOPE),
# data = vltava.env)
# Inertia Proportion Rank
# Total 0.71576 1.00000
# Conditional 0.04264 0.05957 2
# Constrained 0.06245 0.08725 2
# Unconstrained 0.61067 0.85318 92
# Inertia is variance
# Eigenvalues for constrained axes:
# 0.03810 0.02435
# Eigenvalues for unconstrained axes:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
# 0.06648 0.06138 0.04468 0.03336 0.02994 0.02693 0.02328 0.02114
# (Showed only 8 of all 92 unconstrained eigenvalues)

RsquareAdj (pRDA)
# $r.squared
# [1] 0.0872482
# $adj.r.squared
# [1] 0.07016251

# First, let’s define variables with explained variations:
constrained <- pRDA$CCA$tot.chi # 0.06244857 – variation explained by
# explanatory variables after removing the effect of covariables
conditional <- pRDA$pCCA$tot.chi # 0.04263834 – variation explained by
# covariables
total <- pRDA$tot.chi # 0.7157577 – total variation of the compositional dataset

# In R, the R2 is calculated as constrained variation / total inertia:
constrained / total
# [1] 0.0872482

# and adjusted R2 is calculating by subtracting adjR2 explained by covariables
# from adjR2 explained by both explanatory variables and covariables:
R2abc <- (constrained + conditional) / total
R2bc <- conditional / total
n <- nrow (vltava.spe) # number of samples in the dataset
m <- 2 # number of explanatory variables
p <- 2 # number of covariables

adjR2abc <- 1 – (1 – R2abc)*(n – 1)/(n – m – p – 1)
adjR2bc <- 1 – (1 – R2bc)*(n – 1)/(n – p – 1)
adjR2a <- adjR2abc – adjR2bc
# [1] 0.07016251
In CANOCO 5, if the same dataset and the same variables and covariables are used in the same pRDA analysis on Hellinger transformed data, the results are different (if you have CANOCO 5 installed on your computer, download and open this file):

Analysis 'Constrained-partial'
Method: partial RDA
Partial variation is 64.61974, explanatory variables account for   9.3%
(adjusted explained variation is   7.3%)

Summary Table:
  Statistic    Axis 1    Axis 2    Axis 3    Axis 4
Eigenvalues    0.0532    0.0340    0.0929    0.0858
Explained variation (cumulative)    5.66    9.28    19.15    28.27
Pseudo-canonical correlation    0.7624    0.6491    0.0000    0.0000
Explained fitted variation (cumulative)    61.01    100.00        

Permutation Test Results:
  On All Axes    pseudo-F=4.7, P=0.001

As you can see, R2 is 9.3%, while adjR2 is 7.3%, in contrast to vegan‘s 8.7% and 7.0% for R2 and adjR2, respectively. The sum of eigenvalues for the two constrained axes is 0.0532+0.0340 = 0.0872, the same value as the R2 in vegan, but in the header, C5 reports higher values. These are calculated in the following way (continuing the R code above):

[sourcecode language=”r”]
# R2 in CANOCO as
R2.C5 <- constrained / (total – conditional)
# [1] 0.09277488 – after rounding 9.3%

# adjR2 in CANOCO as
1 – (1 – R2.C5)*(n – p – 1)/(n – m – p – 1)
# [1] 0.0730526 – after rounding 7.3

Note that the other parameters are identical in both vegan and CANOCO (eigenvalues, F-value and also the significance).

In the end, I wrote this quick-and-dirty hack of the vegan‘s RsquareAdj function, which includes argument type with the values either “partial” (vegan’s version) or “semipartial” (C5 version):

[sourcecode language=”r”]
RsquareAdj2.rda <- function (x, type = ‘semipartial’, …)
m <- x$CCA$qrank
n <- nrow(x$CCA$u)
if (is.null(x$pCCA)) {
R2 <- x$CCA$tot.chi/x$tot.chi
radj <- RsquareAdj(R2, n, m)
else if (type == ‘semipartial’) {
R2 <- x$CCA$tot.chi/x$tot.chi
R2p <- x$pCCA$tot.chi/x$tot.chi
p <- x$pCCA$rank
radj <- RsquareAdj(R2 + R2p, n, m + p) – RsquareAdj(R2p, n, p)
} else if (type == ‘partial’){
R2 <- x$CCA$tot.chi/(x$tot.chi – x$pCCA$tot.chi)
p <- x$pCCA$rank
radj <- 1 – (1 – R2)*(n – p – 1)/(n – m – p – 1)
if (any(na <- m >= n – 1)) radj[na] <- NA
list(r.squared = R2, adj.r.squared = radj, type = type)

Quick check whether it works:

[sourcecode language=”r”]
RsquareAdj2.rda (pRDA, type = ‘semipartial’) # 0.08724 vs 0.0701
RsquareAdj2.rda (pRDA, type = ‘partial’) # 0.09277 vs 0.07305

Looks like it does. Note, however, if you want to use it, it needs to be further tested (e.g. with datasets of environmental variables include also factors as explanatory variables or covariables).

An experiment with

Today I just made an experiment – I submitted preprint of finished manuscript to bioRχiv (pronounced bioarchive) on, the preprint repository which aims to be the biological twin of

Everything just happened at once – last morning I clicked through the Friday’s linkfest in Dynamic Ecology to the post of Zen Faulkner about his experience with uploading preprint to bioRxiv – in his eyes not too successful, since there were not extra views and comments on the paper, so he conclude that it may be just a waste of time if you are not scientific superstar and Nobel prize nominee. But I got inspired by the comments below that post – sure, to upload the preprint has the function of getting out fast a citable version of your work. At the same time, I finally finished working on a manuscript which is actually a kind of baby I cared about for way too long – it came already through two rounds of reviews (both rejected), and this is the third completely rewritten version prepared to be submitted for review again. But I really hope that at this moment it will be out – for the feeling that things are moving, and that I can link to that paper if I want, or even cite it in the follow up studies.

So I used bioRxiv for that – it’s pretty fast, the submission procedure is a bit similar to journal submission, but doesn’t contain so many questions. Before I submitted the manuscript for review to the journal I plan to, I first uploaded the pre-print here and note this in the cover letter for the submission to the “real” journal. There are options to upload edited or corrected versions any time, so it should be no problem to fix mistakes if there are some (sure there will be). As I understand the pre-print submission has no effect on the success of review in the peer-reviewed journal or potential acceptance/rejection of the paper there, so it sounds like just an added value of having something you can wave in the air.

The only thing which left me wondering was that actually once you upload the paper there, the paper is citable and – it cannot be removed. I spent a while hesitating what this may cause in the future – for example if I found that the paper is just a complete nonsense, I can’t just hide it away, it will be hanging there forever (whatever “forever” means). Hope this won’t be true, but let’s see what will happen.

PS: there is a Nature News article featuring almost three years ago when it was launched (you may click through here).


TWINSPANTWINSPAN is perhaps one of the most popular clustering methods (at least among vegetation ecologists), which is not implemented in R. R-sig-eco forum has several posts (mostly from Jari Oksanen and Dave Roberts) on the topic of TWINSPAN in R, where they described difficulties with importing the original TWINSPAN code (written in FORTRAN) into R. Seems that both Jari and Dave spent considerable effort trying to implement the method into R, but seems like there is some problem which is not easy to crack.

From my experience (and I guess also from experience of many others, not only vegetation ecologists), TWINSPAN does sometimes give rather nice and ecologically meaningful results, since it is based on cutting the data along the main compositional gradients. There is yet another divisive method, somewhat analogous to TWINSPAN, called DIANA (DIvisive ANAlysis clustering), which was proposed by Macnaughton-Smith et al. (1964), described in detail by Kaufman & Rousseeuw (1990), and made available in series of Fortran written programs with poetic names like AGNES, CLARA, DAISY, DIANA, or FANNY (later implemented into R package cluster). But it seems to me that this method for some reason never gained so much popularity like TWINSPAN did, mostly perhaps due to its sensitivity to outlying samples (which will be separated first into one-item clusters). This is why I think it still would be nice to have TWINSPAN handy in R, also because of its modified version, which we introduced couple years ago (Roleček et al. 2009) and which up to now is available only in JUICE, program for editing and analysis of vegetation data developed by Lubomír Tichý (Tichý 2002). Modified TWINSPAN is basically a sequence of divisions calculated by standard TWINSPAN, each time applied on the most heterogeneous group – here is again a similarity with DIANA, which also in each step divides the group which is the most compositionally heterogeneous.

Recently, I got a simple idea, how to make TWINSPAN work in R, and how to extend this implementation also for modified TWINSPAN. The way I used is somewhat clumsy, but it seems to work. I used the twinspan.exe file, which is an executable program based on original FORTRAN library written by Mark O. Hill (author of the algorithm and the original FORTRAN code) and compiled in 2003 by Stephan M. Hennekens into a stand-alone program for use in MEGATAB, software which before was used together with TURBOVEG for editing and classification of vegetation data (Hennekens & Schaminée 2001, Schaminée & Hennekens 2001). I created R package twinspanR, which includes twinspan.exe, and added bunch of supporting functions to maintain import and export of data to twinspan.exe and back to R. So basically the TWINSPAN is calculated by the original Hill’s algorithm, and R functions in twinspanR package are for handling the whole thing conveniently in R (see the notes below for more technical details how the library exactly works).

For details how to install the library from GitHub see the file. Some more examples how to use twinspan package will be (hopefully) made soon available on my website for analysis of community ecology data in R.

The following is an example code, using example dataset danube with Ellenberg’s meadow dataset (used also as an example in the first publication of TWINSPAN and DCA by Hill 1979 and Hill & Gauch 1980, respectively). In this example I used modified TWINSPAN (Roleček et al. 2009) with division into 4 groups and heterogeneity of clusters measured by Bray-Curtis dissimilarity measure. I projected the results into DCA ordination diagram, along to the original tabular classification made manually by Ellenberg (from Mueller-Dombois & Ellenberg 1974):

[code language=”r”]

## Modified TWINSPAN on traditional Ellenberg’s Danube meadow dataset,
## projected on DCA and compared with original classification into
## three vegetation types made by tabular sorting:
library (twinspanR)
library (vegan)
data (danube)
res <- twinspan (danube$spe, modif = TRUE, clusters = 4)
k <- cut (res)

dca <- decorana (danube$spe)
par (mfrow = c(1,2))
ordiplot (dca, type = ‘n’, display = ‘si’, main = ‘Modified TWINSPAN’)
points (dca, col = k)
for (i in c(1,2,4)) ordihull (dca, groups = k, = i, col = i,
draw = ‘polygon’, label = TRUE)
ordiplot (dca, type = ‘n’, display = ‘si’, main = ‘Original assignment\n (Ellenberg 1954)’)
points (dca, col = danube$env$veg.type)
for (i in c(1:3)) ordihull (dca, groups = danube$env$veg.type, = unique (danube$env$veg.type)[i], col = i,
draw = ‘polygon’, label = TRUE)


example twinspan figure

Some technical details how the twinspanR package works

Some further details how it works. The executable file, twinspan.exe, is stored in \exec subdirectory of the library. There is also tw.bat file, which launches twinspan.exe and feeds it with data from R (I have shamelessly stolen this idea from the way how Lubomír Tichý executes TWINSPAN in JUICE, and it is also similar to the way how Tom August implemented another Hill’s FORTRAN library, FRESCALO (Hill 2011), into R as a function frescalo in package sparta). I take compositional data in R, transform them to required *.cc! format (luckily there is a function write.CEP in rioja package, written by Steve Juggins) and save them to \exec subdirectory (where also twinspan.exe is located). Then, I create file tw.dat with input parameters for twinspan.exe (using function, and use the shell function in R to launch the tw.bat file. All calculations are done using original twinspan.exe, R just reads its output from tw.PUN file. It’s in a no way an elegant approach, but works just fine.

The twinspan.exe file used for calculation of TWINSPAN in R is taken from the distribution of JUICE program (Tichý 2002). It is the original FORTRAN code written by Mark O. Hill, compiled (cca in 2002) by Stephan M. Hennekens into twinspan.exe for use in his program MEGATAB (which was used together with Turboveg for analysis of community data). The version of twinspan.exe used here implements the changes by Petr Šmilauer, related mainly to problems with algorithm convergence, which cause the results being dependent on the order of samples in the input data table. Note that this algorithm is slightly different from TWINSPAN implemented in WinTWINS software (Hill & Šmilauer 2005), which implements also other modifications by Birks and ter Braak. In fact, there are at least four different versions of TWINSPAN recently used for analysis, and in certain circumstances they differ in the results how they classify the samples (when I have time, I will elaborate this topic further).

Unfortunately, the presence of twinspan.exe file in the twinspanR library is a problem for its portability – seems like CRAN doesn’t allow packages with executables inside (for understandable security reasons), and R-Forge allows it to be uploaded, but fails to build it. For now, the channel for distribution of this package is GitHub, and it will perhaps remain there until some other solution will show up. Simply, it’s just a quick and dirty way how to get TWINSPAN functionality in R without too much hassle, before somebody manages to write fully functional implementation of TWINSPAN in R.


  • Hennekens S.M. & Schaminée J.H.J. (2001): TURBOVEG, a comprehensive data base management system for vegetation data. Journal of Vegetation Science, 12: 589-591.
  • Hill M.O. (1979): TWINSPAN: A FORTRAN Program for Arranging Multivariate Data in an Ordered Two-way Table by Classification of the Individuals and Attributes. Cornell University, Ithaca, NY.
  • Hill M.O. & Šmilauer P. (2005): TWINSPAN for Windows version 2.3. Centre for Ecology and Hydrology & University of South Bohemia, Huntingdon & České Budějovice.
  • Hill M.O. & Gauch H.G. (1980): Detrended correspondence analysis: An improved ordination technique. Vegetatio, 42: 47-58.
  • Kaufman L. & Rousseeuw P. J. (2009): Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
  • Macnaughton-Smith P., Williams W. T., Dale M. B. & Mockett L. G. (1964): Dissimilarity analysis: a new technique of hierarchical sub-division. Nature, 202: 1034-1035.
  • Mueller-Dombois D. & Ellenberg H. (1974): Aims and Methods of Vegetation Ecology. John Wiley & Sons, New York, Chichester, Brisbane, Toronto.
  • Roleček J., Tichý L., Zelený D. & Chytrý M. (2009): Modified TWINSPAN classification in which the hierarchy respects cluster heterogeneity. Journal of Vegetation Science, 20:596-602.
  • Schaminée J.H.J & Hennekens S.M. (2001): TURBOVEG, MEGATAB und SYNBIOSYS: neue Entwicklungen in der Pflanzensoziologie. Ber. Reinhold-Tüxen Ges., 13: 21-34.
  • Tichý L. (2002): JUICE, software for vegetation classification. Journal of Vegetation Science, 13: 451-453.

Mean Ellenberg indicator values: too good to be true

Zeleny-David-EVS-2012 2I started to think more intensively about Ellenberg indicator values issue at one conference while listening to the presentation, where a colleague used mean Ellenberg indicator values as explanatory variables in constrained ordination. I considered this as a kind of statistical heresy, perfect example of circularity of reasoning – you take your vegetation data, calculate mean Ellenberg indicator values for each plot, and in turn use these mean values to explain the original data. But it’s tempting – mean Ellenberg values are often considered as good proxies for measured environmental variables, and they are easy to calculate, so using them as explanatory variables is attractive. I tried that – I took a dataset with measured soil pH and calculated mean Ellenberg values for soil reaction, and compared how much variation in species data will be explained by pH and how much by mean Ellenberg; Ellenberg was a way better predictor than measured pH. Ok, so here we have the consequence of circularity. Thinking it through, I concluded that the reason is that mean Ellenberg values carry legacy of the species composition, from which they were calculated – if two plots have the same species composition, their mean Ellenberg values will be identical (considering mean not weighted by species covers), and if the species composition differ a bit, Ellenberg will change just slightly (changing one or a few numbers while calculating the mean doesn’t change the result too much).

I wondered what would happen if I reshuffle species Ellenberg values among species before calculation of mean for the vegetation sample, or if I replace the original species values by randomly generated ones. This would remove ecological meaning of the values, but keep the side effects of calculating the mean. No wonder – mean of randomized species Ellenberg values explains still more variation, when used in constrained ordination, then does random numbers (just keep in mind that every, even randomly generated variable, explains some variation – if this doesn’t make you happy, consider using adjusted R2 instead). There is no ecological information in these randomized values, so the extra explained variance is the legacy of species composition imprinted in mean Ellenberg values. I used randomization of species values as a base for modified permutation test, which can be applied for correcting the issue – not necessarily in constrained ordination, in which mean Ellenberg values are rarely used (scanning JVS and AVS journals through last ten years returned two papers), but also in unconstrained ordination, when mean Ellenberg values are correlated with ordination axes and the correlation is tested (actually this treatment is fairly common, although the problem with circularity of reasoning is exactly the same, yet not so obvious, as in constrained ordination).

I wrote an R code and later also a simple clicking program (MoPeT) running in R and calculating modified permutation tests, which is otherwise not easy to do. I presented the results in 2009 IAVS in Crete and later published a paper in JVS, together with André Schaffers, who helped me a lot in sorting the ideas and writing the manuscript. Still, I am not sure if ever somebody will really use this routine – mean Ellenberg values are great for description, but it’s perhaps better to keep them away from more sophisticated statistical treatments.

The story has actually an unexpected follow up, although I hoped that I won’t touch Ellenberg values any more. As a parody of life, I am just working on a paper describing statistical way how to justify the use of mean Ellenberg indicator values as explanatory variables in constrained ordination, and even do such things like partitioning the variation among different Ellenberg values or between Ellenberg values and measured variables. I presented this topic on EVS workshop in Vienna this spring, where it went through without any feedback – I guess the audience was even a little disgusted by such an overly technical talk. I really don’t feel like convincing somebody to use mean Ellenberg values as explanatory variables in constrained ordination, but I can’t help feeling quite fascinated by the imagination that something like this is actually possible.

Gap Light Analyser for 64 bit Windows

UPDATE on November 15, 2016: Gap Light Analyser became newly available also for 64-bit Windows machines; the installation files can be downloaded from the original GLA website. Thanks to David Wojtowicz of the University of Illinois for updating the original installer and sharing the news! As David mentions  in the comment below, he became an unofficial maintainer of this update.

On the last terrain excursion from geobotany, we were facing the problem with Gap Light Analyser – the freeware program for analysis of hemispherical photographs of tree canopy, taken by camera with fish-eye lens. It can’t be installed on 64-bit version of Windows 7, although there was no problem with 32-bit version. In the end one student had still an old 32-bit machine, so we managed to analyse the photographs, but I was wondering how to solve the problem for future. I wrote to Gordon Frazer, the author of GLA, for some hack about how to solve the situation, and he recommended to install Windows Virtual PC and with Windows XP, and run installation of GLA on it. I tried it, and seems to work fine, although it’s a bit heavy-weight, because the installation takes some time (it took me one hour to complete the whole process). You need to visit Microsoft website (, choose your system (suppose you have Windows 7 Professional, I haven’t tried for other versions) and download the installation packages for Virtual PC and Windows XP (for free if you have genuine Windows system, which needs to be checked online). Install it, following the instructions on the website, and finally you can run virtual PC and install GLA on it. It’s a bit slower than ordinary PC, but it works. Unfortunately it’s not an ‘easy’ solution in a sense that you can’t make this somewhere in the field without good connection to internet (the installation files are over 400 MB large). But seems like there is no other easy (and free) solution – I checked two other freeware programs for analysis of canopy images, namely  hemIMAGE and Winphot*, but both has the same issue like GLA – they can’t run on 64-bit machine (at least I have not succeeded to install them). There is also commercial software (e.g., HemiView), but I didn’t purchase it, so I have no idea if it works (I guess it should).

* There is a paper in Waldökologie from Alvaro Promis and colleagues, which compares these two packages with HemiView and GLA – that’s how I found them…

Species response curves in JUICE and R

Today I spent some time in attempt to fix the function of Species response curves in JUICE. The script for calculation of species response curves is written in R and it has problems from the whole beginning, but as my interest in using species response curves decreased, the same decreased my will for fixing the bugs. To make it short – I haven’t manage to fix it, and there are several reasons for this.

One is package gravy, which is used for calculation of HOF curves. Jari Oksanen has written this package almost 6 years ago (at least the newest version is from the year 2004) and didn’t updated it since that – and in the email two years ago he said he is not going to mantain it any more. This wouldn’t be so big trouble, but since R version 2.11 this packages no longer runs, which makes it imposible to run the script for HOF curves in newer versions of R. I tried to recompile gravy for newer versions of R, but didn’t succeed (mainly because the source code for gravy.dll dynamic library is not publically available, and I didn’t want to bother Oksanen about that).

Anyway, gravy didn’t really perform too good. The algorithm often results into ‘sharp shapes’ (which I documented in details here). I don’t really have ability to fix this, as this goes beyond my mathematical skills. However, seems that Florian Jansen was working this topic through and found some remedy – he reported it two years ago in IAVS in Stellenbosch (presentation could be found here) and he has also written R package, which contains updated version of HOF functions. The packages is rather new and under development (I have seen the latest update from August 2010), so it still needs perhaps some time until it will be reliable. But this would be option to go. The packages could be downloaded from Florian’s website (basic package is vegdata, and the package contains additional function, which are under development – such as those HOF curves).

So, the recent situation is like this: the function about species response curves in JUICE works fine only under R version 2.8 or lower, and it’s perhaps not going to change in recent time. In the future, I hope to remake it from gravy to vegdata package. For those seriously thinking about playing with species response curves, I would recommend use R directly, not the wrapper from JUICE; if you are not R experienced, try to go for CanodDraw for Windows, which offeres GAM and GLM models for species response curves (however, it doesn’t offer HOF). That’s it!

UPDATE (September 2012): There is an alternative, which works right now – I rewrote the original R package for JUICE into JUICE-R function, which can be run with data from JUICE, but it’s not so problematic like the previous solution. It can be found on JUICE-R website in the section Available R scripts as Species response curves. It’s based, however, on the same algorithm as the previous function in JUICE, so the problem with HOF models remains and I would not really recommend to use them…