# Unconstrained ordinations in vegan: an online Shiny app

The R package “vegan”, written and maintained by Jari Oksanen, Gavin Simpson and others, contains the whole set of functions needed for unconstrained ordination (and many other things) in R. I also experimented with other packages (“labdsv” by Dave Roberts or “ade4” by Stephane Dray et al.), but “vegan” is unbeatable when it comes to ordinations. One a little annoying thing in “vegan” is the naming convention of various ordination functions. It’s relatively easy to get used to that PCA (principal component analysis) and CA (correspondence analysis), respectively, are calculated by “rda” and “cca”, respectively, which, if one also supplies environmental variables, also calculate constrained versions (RDA aka redundancy analysis and CCA aka canonical correspondence analysis). DCA (detrended correspondence analysis) is calculated by “decorana”, which also makes sense, if you know the history of the method and the fact that first program written by Mark Hill in Fortran to calculate DCA was called exactly in this way (and in fact the “decorana” function in R is based on the original Fortran code ported into R by Jari). Somewhat harder is to remember that NMDS (non-metric multidimensional scaling) is calculated by a function called “metaMDS”, and the definite killer for me is PCoA (principal coordinate analysis), calculated by “cmdscale”. When you are teaching these to students, it may get a bit frustrating, so some handouts come handy. That’s partly why I got this idea: an interactive online tool, which would allow using few example community ecology dataset, analyse them by a set of standard ordination methods, draw resulting ordination diagram, print summary output, and mainly – show also the R code which was used to do all that. The result is here, welcome to click through!

Previously I used Shiny apps to play with visualisation of random drift, and I got quite hooked on the concept of a dynamic website with R running behind. In the case of this project, it got more complex, and I spent the whole day learning various Shiny hacks (e.g. how to modify dynamically also the clickable menu at the left panel). Although the result is far from perfect (and perhaps will stay like that for a while), it’s functional. You may choose one of the datasets (short description and the link to the website with details will appear), then choose one of six ordination methods, transform or not the community data, and decide what to display in the ordination diagram. The figure above shows the PCA on a simulated community dataset structured by one long gradient, and a lovely horse-shoe artefact which PCA chronically produces on datasets with many zeros.

Don’t get me wrong – this is not an attempt to provide an online, clickable alternative interface to “vegan”, I have no intention to do something like this. But I think that Shiny apps are a great teaching aid and visualisation tool. If you have a new R package and want to show how it works, a small Shiny example may be much better than a long online tutorial (at least as an appetiser with the aim to hook users actually to try the package).

Welcome to play, and thanks for any feedback!

# Piping tibbles in tidyverse

A couple of weeks ago, while preparing for the R for ecologist class, I found tidyverse, a suite of packages (mostly written by Hadley Wickham) for various ways how to tweak data in R (thanks to Viktoria Wagner for wonderful online materials for managing vegetation data!). Somewhen in 2009 I attended useR! conference in Rennes, France, where Hadley held a workshop on using one of these packages (I think that time still called plyr). I attended it, but to be honest, I remember nothing from that workshop, it was a way too abstract for me. Since then several times when preparing data for analysis in R, I thought that I should check it again and finally learn it. And here we are, it is coming, preparation of the R class pushed me to do that finally. I also found the piping of data through the functions, with an entirely different logic of sending data from function to function*, and a new form of data frames called tibbles (no idea what the word means). All the new fancy things; R moves forward quite wildly and erratically. But this post is not about piping, tibbles, and tidyverse. It is a quick thought about me using R, how did I change, and whether I should do something with that or not.

For the first time, I used R somewhen around the year 2005 when I started to work in Brno, my previous university, and a colleague asked me to calculate and draw species curves which were hard to do anywhere else. I recall S-PLUS, a program I used as a master student when I took the class of Modern regression methods while studying in České Budějovice. Petr Šmilauer that time taught it in S-PLUS, commercial system, which turned to be R’s ancestor. Knowledge of S-PLUS came handy, and I was able to do quite pretty figures in R. That time there were just a few R packages, and no RStudio (I think I was using Tinn R editor, it still exists actually). R was not a sweet candy at the beginning, I remember quite some time spent by frustration from always occurring annoying error messages, but eventually I kind of started to like it. Later I found that the skill of using R is a powerful advantage – I was able to calculate or draw almost anything, things that would otherwise need to be done in a cumbersome way in some clickable software. That time there was just a couple of other people fancy in R. I started to teach R, which was a great way to push myself to improve, and to spread the knowledge of R among others. How different is it from today, when R is considered a lingua franca of ecologists, scientists publish papers with R codes appended, and common requirements for a newly hired research assistant, postdoc or even faculty member is a skill of using R.

But I also changed. There was a time, couple of years ago, when I was eager to learn all different new developments in R, study fancy packages, try new analytical or visualisation methods. At one time, together with Víťa, another R guy from Brno, we even taught seminaR, a class focused purely on trying new fancy things in R. But it’s gone. When it comes to R, I start to be somewhat conservative, stick to what I know and feel quite reluctant to discover new things. Partly perhaps I got lazy, but partly it has reasons. Some of those new packages, methods and “new trends” in using R, in the end, turned to be just ephemeral matter, packages were not maintained, their developers deserted them and turned interest into something else. Also, now the use of R is so broad, that it is hard to keep track of all new and exciting things. And my time is also not what it used to be; I can’t spend a week or more by trying to tweak the R code in this and that way, thinking about it day and night, excited and barely sleeping. It is not that I don’t use R anymore – actually I use it on a daily basis, at any time I can be sure that some of my computers have RStudio on, with some half-written script or some long code running. But I use it as a tool to solve some problem, and I focus on the problem itself, instead of keeping polishing my skills of using that tool. Learning R was, without doubt, one of the best investment I did in my professional life, and now I hope to move further while keep using it to do something else.

Since now every winter semester I teach R for ecologists, I keep myself somehow fit in using R in a way that I sort things to be able to explain them to students. The class is actually focused on pretty basic R stuff. Half of it we mostly draw figures, because this does not require any added theoretical background which would be needed if I use R to teaching, e.g. statistics. The second half of the class we focus on “business as usual, with R”. Loops, functions, importing, exporting and manipulating data, and some simple calculations, mostly to show students the benefit R may bring them if they use it. But that is actually the point; I do not focus on bringing new fancy things to the class, but instead, teach R oldies goldies in a way as smooth and digestible as possible.

But, piping tibbles in tidyverse let me think that it may be time to sit down for a while and recheck how R changed in the past couple of years I did not follow it. Here comes my point. Do you have some suggestions what to look for? What in R do you find useful recently that you cannot breathe without and you would suggest me to learn it? No specialised packages, something for “everyday business”. For example, I still wonder whether I should learn ggplot2 or be happy with my base R graphics combined with lattice – do you have some opinion about that? Any comments welcome!

* If you are familiar with R, than piping (library magrittr) can replace (for example) sum (log (sqrt (abs (1:10)))) by 1:10 %>% abs %>% sqrt %>% log %>% sum. Quite a revolutionary way how to assemble script together. If not familiar with R, just don’t worry about that 🙂

# Visualizing random drift in ecological communities. Part 2.

In the first part of this post, I focused on a simple illustration of random drift without acknowledging any other process. In the case of trees growing on a small island with only 16 individuals, it does not matter which individual died and which reproduced to replace it, because all offsprings could disperse anywhere and the whole process looked more like swapping the colours on the balls in the bag. Here, I explicitly include dispersal in two forms: as a movement of the offspring limited to some short distance, and as an immigration of species from outside. Then, in the second step, I also add the process of selection, which means that I include environmental heterogeneity into the model (elevation), and let species differ from each other in their ecological preference. While in the first part I was interested how changes in the number of species, number of individuals and number of “generations” influences the outcome of random drift, here I will ask how important is ecological drift when interacting with immigration and selection. For all simulations I used an R package developed by Tyler Smith and published in Smith & Lundholm (2010); for more technical details, see the last chapter below.

### Artificial landscape: it matters if the individual has a neighbour and how far its offspring can disperse

For all further simulations in this post, I will use artificial landscape of certain size (e.g. 50×50 or 100×100 cells, each cell accommodating maximally one individual) and fixed number of species (10). At the beginning, each cell in the landscape is occupied by one individual of randomly chosen species, which means that each species is represented by a similar number of individuals. In each cycle, a fixed proportion of individuals will die, and a fixed proportion of individuals will produce one offspring. The dispersal of the offspring is spatially limited in that it can get only to one of the neighbouring cells, rarely a bit further. Starting from the second generation, there could be vacant places in the landscape after inidividuals which died; they remain vacant until offspring of some neighbouring individual occupies them. See the figure below showing this dynamic on the example of a landscape with only four species. In the scenario without immigration this is all; if we add immigration, then in every generation, individuals of randomly selected species will be added to the marginal cells of the landscape. Note that all individuals of all species are demographically equal (i.e. all have the same probability to die or reproduce) and all cells within the landscape have the same “environmental conditions”. In this sense, the model shows “neutral” community where only neutral processes (drift and dispersal) operate.

Example of neutral dynamic in the landscape with only 4 species. Each circle is an individual, and each color is a species. At the beginning (1), all cells are filled by individuals of randomly chosen species. The next window shows the result after the first generation; some individuals died, and some other replaced them, but only within the neighbouring cells; white spots are empty. Panels (3) to (9) shows results of next generations.

An interesting property of this model is that abundances of species, which at the beginning were almost equal, will become highly uneven. I added diagram with species abundance distribution (see the figure below, the right panel), in which species are ordered according to the number of individuals (from high to low), and each species is represented by a bar with relevant colour. Order of species in this distribution dynamically changes, and in the course of time, some species become rare and eventually extinct. As a result, only one species dominates the whole landscape.

Artificial landscape, with 50×50 cells each occupied by max one individual, 10 species. Left panel shows distribution of species in the landscape, right barplot shows species abundance distribution (bars are ordered decreasingly by the number of individuals, and scaled relatively to the most dominant species). This is the initial situation: landscape with all cells occupied by one individual of randomly selected species.

Artificial landscape after the first generation. Some cells became empty (white), and species abundances start to differentiate from each other.

After 100 generations, abundances of species even more differentiated. No clear spatial pattern yet, no species extincted yet.

After 200 generations. Clear spatial pattern arises, one species becomes dominant, one species extincted.

After 500 generations. Only three species left, from which one (red) clearly going to extinct soon. Abundances are highly uneven.

To know the whole sequence from beginning to the end, see the video below.

In the video you see that after the red species extinct (time stamp 1:06), the two remaining species keep silently fighting for a long time; in fact, I haven’t had the patience to run the simulation long enough, but eventually, one of the two species will for sure win.

The simulation above is quite similar to the one done in Part 1, just the landscape is “spatially explicit”, meaning that it matters where in the landscape the individual occurs and what are its neighbours.

### Including immigration: constant supply of individuals from outside

Now let’s add immigration. The following simulation has the same parameters as the one before (the landscape of 50×50 cells with ten species), except that after each cycle, the individuals of randomly selected species will be added to marginal cells (that’s why the landscape keeps colourful margins all the time). The initial situation looks very similar to the one above, so below I show only the situation after 200 and 500 generations.

Artificial landscape with immigration, after 200 generations. Species abundance distribution highly uneven (ligh blue species recently dominating), but no species extincted.

Artificial landscape with immigration after 500 generations. Nothing much happening, species happily coexist.

Again, see the video below for the whole time sequence. Just note that it is actually pretty boring; drift makes some of the species rather rare, but due to constant immigration no species is going to extinct (and if yes, they will be most likely added in the next generation into the marginal cells). If the speed of immigration is high enough as in this simulation, the species actually happily coexist together.

Some conclusions from what we learned so far. In the isolated landscape without immigration (like island far in the ocean, or patch of some natural habitat in the landscape isolated from other patches by human activities), the random drift earlier or later results into extinction of all but one species; how long it takes depends on the number of individuals (size of the island) and initial number of species. Immigration can revert this result by returning the species to the game.

### The end of neutrality: adding environmental heterogeneity and selection

So far all cells within the artificial landscape were equal, and all species have an equal chance to survive in any of them. The next step is to include environmental heterogeneity and selection between species. Environmental heterogeneity means that different cells have different environmental conditions (or habitat properties).

In our model there is only one environmental gradient; let’s call it elevation. The spatial distribution of elevation is not random – it forms four “humps” which we will call hills. Species, so far, were also equal in that they did not have any environmental preferences (even if they did, so as so the environment was perfectly “flat”). Now, each species gets ecological optima in one elevation zone, but it can (with lower success) grow also in the adjacent elevation zones. For example, species 5 has optima in elevation zone 5, but its offspring can also survive in the zones 6 or 4 (but not much higher or lower). This is the selection by environment: species 5 has an advantage within the cells of elevation zone 5, while species 4 has an advantage within the cells of elevation zone 4; if species 5 gets to elevation zone 4, it may still survive there, but with a lower probability than species 4. Our model has four hills, ten elevation zones and ten species with each species specialised on one elevation zone. The proportion of cells is roughly equal for each elevation zone. All other parameters are identical to the previous simulation, except that the landscape is 100×100 cells large (to squeeze all four “hills”).

To visualise this, each snapshot has three different panels (see below). I removed species abundance distribution and included 3D visualisation of our four hill landscape instead. I also added the stacked barplots showing relative abundances of individual species within each elevation zone within each mountain. As it will turn out, this visualisation will become quite important to understand what is happening.

Artificial landscape with four hills representing elevation (min 1, max 10). This is the initial situation, with individuals randomly dispersed across the landscape. The left panel shows the aerial view of this landscape, with contours for elevation, and individual peaks marked according to geographical position (NW, NE, SW, SE for the north-west, north-east, south-west and south-east, respectively). Upper right panel shows the same in 3D (note that the hills have been “smoothed”, otherwise they would have “stairs” on the slopes). The lower right panel shows four stacked barplots, each showing relative abundances of each species within each elevation zone. In this initial step, all species are everywhere and proportions of species in each elevation in each hill are similar. Species elevation optima shows which species (colour) has optima in which elevation zone (the number).

Artificial landscape with topography after 100 generations. Spatial pattern starts to emerge (left and upper right panel). All species are still in most elevation zones (barplots in lower right panel), but there start to be differentiation according to species ecological preferences (e.g. dark violet species stats to prevail in lower elevations, while light and dark blue species in higher elevations).

After 500 generations. Spatial pattern is rather clear, and species differentiation along elevations also. Note one interesting thing: different peaks start to be dominated by different species (dark blue vs light blue ones). Peaks are like small islands in the sea of lowland, and ecological drift here works faster than in other parts of the landscape.

After 3000 generations. Dark blue species dominates NE and SE peak, while light blue dominates NW and SW peaks (with light green in SW).

Video with the whole sequence is below.

Perhaps the most interesting is that different peaks of the hills become dominated by different species, as a result of random drift (dark blue species dominates eastern peaks, while light blue western peaks). Lowland is also partly differentiated, but since lowland of each mountain is interconnected with each other, this differentiation is arbitrary. In this way, random drift increased the beta diversity of along elevation; if we put a sample in each peak, the differences in species composition (beta diversity) will be high, while not so high in the lowland.

With immigration, the pattern is similar, just more dynamic. Lowland cells have most of the species, but this is mostly the result of the immigration into the marginal cells of the landscape (which are, due to the construction of the model, all in the lowland).

Adding selection to the mix of drift and dispersal has an interesting outcome. Stochastic ecological drift acts against a deterministic effect of selection via “environmental filtering”, and as a result, ecologically identical habitats (elevation zones) can have similar or somewhat different species composition, depending on their isolation. Hill peaks in our model are the most isolated, and as a result of drift and dispersal limitation, different peaks became dominated by different species. Immigration can revert the effect of drift, but only if the habitats are not isolated (influence of immigration is high in the lowland which is interconnected, but not in more isolated higher elevations).

This leads me to the question: how important is ecological drift in real communities? How much is the pattern which we see in nature actually result of predictable species ecological preferences and biotic interactions, and how much it results from upredictable effect of drift and dispersal? This “niche-neutral” riddle is around already for couple of years, but in my impression the methodological problems and lack of appropriate data about ecological communities make it difficult to answer. But that’s a topic for another post somewhen in future.

### Technical details: how I made it

The simulation is done in R by packages neutral.vp written by Tyler Smith and published in Smith & Lundholm’s 2010 Ecography paper. The original library published as an appendix to the paper is already not working, but I fixed it to be able to compile it in newer versions of R (details here). I have been playing with this library couple of years ago, when I was deeply in love with variance partitioning of beta diversity among environmental and spatial components (the love is gone, luckily). There are some other (even more advanced) models simulating spatially explicit communities (one overview is here), but for the purpose of this blog, neutral.vp is perfect and rather fast (at least for reasonably small landscapes). Visualisations were done in R (the squared landscape using functions image and contour, the 3D models of a landscape with four hills using the function persp). Figures were assembled into a video in Movie Maker (Windows). The R scripts are a bit messy, so I don’t provide them here, but if you wish, I will be happy to share.

### References

• Smith, T.W. & Lundholm, J.T. 2010. Variation partitioning as a tool to distinguish between niche and neutral processes. Ecography 33: 648-655.

# Visualizing random drift in ecological communities. Part 1.

Ecological drift is one of the four main processes maintaining species coexistence in ecological community, along with selection, dispersal and speciation (Vellend 2010, 2016). To me, it seems that from these four processes, the effect of ecological drift on diversity and species composition of ecological communities is the least intuitive. Here I come with a few interactive illustrations and videos showing how the ecological drift works and what it does.

I separated this post into two parts, forming a logical sequence. Here, in part 1, I focus purely on the random drift, without any other process in action. Destiny of all simulated communities is the same: eventually, only one species will prevail, and all other will go extinct, it is just a matter of time when. In part 2, I will also add dispersal and create a virtual landscape in which it matters who is your neighbour and how far your offspring can disperse. Then, I will also add selection, making species not ecologically equal and letting them search for their ecological optima. Combination of drift with dispersal and selection will be the most closely resembling the community assembly in real ecological communities, although being still very far from the situation in any real community. And hopefully, it will convince you that random drift is a process which can get more important than you would think.

But let’s start slowly.

### Island with 16 trees of two species: how the ecological drift works

Imagine a small, tiny island – so tiny, that only 16 trees could squeeze in, not a single one more. The island is surrounded by sea, far from any mainland, perfectly isolated. In our example, we start with eight individuals of each species (red and black, see the figure below). A single process is happening: at any given moment, one individual dies and empties its place, and another individual reproduces, and its offspring fills the vacant place. The selection of individual which dies and which reproduces is random; it does not matter to which species the individual belongs or where on the island it occurs. See the illustration below.

Random drift on remote island with 16 trees of two species. Yellow circle indicates species which has been selected to reproduce and whose offspring fills the vacant place.

What happens after several cycles of die-reproduce-replace? Watch below.

The population is small, so even a tiny fluctuations in units of individuals, in the end, lead to the extinction of one species (in this case the red one). If both species have an equal proportion of individuals at the beginning, the probability to extinct is equal for both since there is no difference between species and between individuals (species are ecologically equal, meaning that they have no fitness differences – they have the same probability to reproduce and survive to the next generation).

The same thing can be drawn with the x-axis as time and y-axis as the individuals, with different species displayed by different colour. Because the position of individuals in the space of island does not play any role in this example, we can stack the individuals of the same species (the same colour) together to see overall trends of species abundances in time:

Trend of abundances of two species in time, influenced by random drift. Less than 100 generations were needed in this case for one species go extinct from the island.

Now, let’s rerun the same scenario several times (nine times in this case), to see what happens:

Random drift on island with 16 individuals and two species, running for 300 generations.

Note that I increased the number of generations from 100 to 300 in this multiple run example, to show that for most of the replicates this is enough time for one species to disappear from the island. Result in this case: in eight cases, one species extinct (in four cases it was red, in four black), and in one case still both species co-exist on the island. Note that the first panel (top left) is the same as the single figure above (with 300 instead of 100 generations).

We can play further with this visualisation. How about a bigger island, say with 100 individuals?

Random drift on island with 100 individuals and two species, running for 300 generations.

Not enough generations for any of the two species to extinct. But if we increase considerably the time (into 5000 generations), things happen:

Random drift on island with 100 individuals and two species, running for 5000 generations.

The size of the population decides how fast the random drift will cause one of the species go extinct; more individuals need more time. But anyway, it is just a matter of time when the species will “drift away” of the community.

Now, how about more species? If the community of 100 individuals is cut into ten species, the population of each species becomes rather small, prone to extinction.

Random drift on island with 100 individuals and 10 species, running for 300 generations.

Even if the time is short (300 generations), in three out of nine replicates, one species extincted from the community (check the text in the titles of each plot: Start: indicates how many species were in the community at the start of the simulation, end: counts how many were in the end). If we take the same communities and increase the time to 5000 generations, again, things happen:

Random drift on island with 100 individuals and 10 species, running for 5000 generations. First 300 generations in each panel are identical with previous figure.

After 5000 generations, between one to three species left from what was community of 10 species at the beginning. A little hard to imagine that this figure is showing the same communities as the previous one, just “much later”, right? See the animation below (whcih focuses on the first panel (A) from the nine):

From the rainbow flag at the initial situation (after 100 generations, all species still have rather a similar number of individuals), after 5000 generations, all but one species eventually extinct.

### Forest fragment with ~5000 trees and 74 species: getting community parameters closer to something real

Ok, so let’s start to get more real. First, the populations of organisms are usually much larger than 100 individuals. Also, different species have different relative abundance, some are common, and some are rare (this is called species abundance distribution, which is usually highly uneven). Obviously, more common species seems to have a lower probability of extinction compared to rare species. To get more realistic parameters of our island community, I borrowed the data from the real forest in Taiwan, 25-ha large forest permanent plot in Lienhuachih (LHC). As in other forest dynamics plots established following the standards of Smithsonian Institute, all tree individuals with DBH (diameter in breast height) larger than 1 cm are measured, determined to species, their exact location is recorded, and each is permanently marked. Altogether, during the census of LHC plot in 2005 researchers found 153,255 woody individuals of 144 species. To make things simple here, I will consider only individuals with DBH > 20 cm; we get 5449 stems of 74 species. Their absolute abundances are rather variable: two the most abundant species, Engelhardia roxburghiana and Schefflera octophylla, have 599 and 532 individuals each, respectively, while seven species have each only one individual. The two figures below illustrate the flavor of these data:

Distribution of individual trees (DBH > 20 cm) in the 25ha plot of Lienhuachih forest dynamics plot, Taiwan. The size of the circle is proportional to DHB of individual trees, and different color means different species.

Liehnuachih forest dynamics plot. Photo credit: David Zelený.

Let’s put these data into the random drift visualisation, directly with 5000 generations:

Random drift in forest with 5449 trees, 74 species and realistic species abundance distribution, running for 5000 generations.

Seven species went extinct (mostly the rarest ones), but otherwise nothing too much is happening in such short time for such a big community. Note the new visualisation tool here: the left panel is the same type of figure you have seen already above (x-axis is time, y-axis is absolute number of individuals for each species), while the right panel shows species abundance distribution at the beginning (gray bars) and the end of 5000 generations (colorful transparent bars). In the barplot, species are sorted from the most to the least abundant at the start of the simulation (i.e. the tallest bar is Engelhardia roxburghiana with 599 individuals, the second tallest is Schefflera octophylla with 532 individuals).

The figures below show the same community, progressing further (50,000, 500,000 and 5,000,000 generations):

Random drift in forest with 5449 trees, 74 species and realistic species abundance distribution, running for ~50,000 generations.

Random drift in forest with 5449 trees, 74 species and realistic species abundance distribution, running for ~500,000 generations.

Random drift in forest with 5449 trees, 74 species and realistic species abundance distribution, running for ~5,000,000 generations.

The result is as expected; rare species became rather quickly extinct, and few dominant species survived. The expected probability that species will extinct is negatively related to its initial abundance, being highest for rare species. But that is “expected probability”; what we see in this simulation is one of many possibilities how the random drift changes species composition in time. In the figure with 500,000 generations, you can see that dark blue species, which has initially been rather rare, became quite abundant (almost 200 individuals); eventually, however, it still extinct. Video below shows the whole sequence between 5,000 and 5,000,000 generations with steps of 10,000 generations:

### Play with random drift: interactive application

Time to try things by yourself. Below is an interactive application, in which you can play with random drift by yourself. By default, it creates a community of 10 species, 100 individuals and 1000 generations, and displays abundances of individual species by colours of the rainbow. Feel free to modify different parameters; for example, set two species, 16 individuals and 100 generations to get the same scenario as displayed in the first figure of this website. Whenever you modify some parameter, the whole simulation is recalculated with new randomly generated values. If you want to stick to some version, fix the seed of the pseudorandom generator (set seed to YES); this will return the same generated community. You can calculate more than one (four or nine) scenarios in parallel.

Feel free to play! (click here to open the app in a new windows)

### Technical details: how I made it

Everything on this website was done in R, using a relatively simple random.drift function. Originally I created this function as an exercise with students in my R for Ecologists class; it has been a simple function with two species from which one will eventually drift away, and the purpose was for students to practice how to do a loop and write a function. Eventually, I found it interesting to add an option to have more species and include more arguments to improve flexibility. See the definition of the function on my AnaDat-R website. Videos are not fully R-made – I used R to generate the sequence of *.png images but merged them into a video using Window’s Movie Maker. Hope in the future this can be replaced by some function in ”magick” package, which is a recent fresh import of ImageMagick into R (but seems not having all the functionality ImageMagick has). Yeh and GIF animations were done with package ”animate”, function ”saveGIF”; there are some shortcomings in this package, for example, while I can modify the width and height of the figure, I cannot modify the resolution and pointsize argument; or maybe it just needs to study more.

Additionally to the R code, I also made Shiny application for the random.drift function, see the previous section. The background calculation is run in R on Shiny servers, and if you choose larger values, it may take a while to display the outcome. I got quite fascinated by the possibility to create such interactive apps; it is actually quite simple. More will for sure come in future.

### References

• Vellend, M. 2010. Conceptual synthesis in community ecology. The Quarterly Review of Biology 85: 183–206.
• Vellend, M. 2016. The Theory of Ecological Communities. Princeton University Press.

# Simpson’s similarity index vs Simpson’s diversity index

Since I started to teach multivariate methods in community ecology and additionally to chapters about ordination I also included the chapter about diversity, I found it confusing that there are two “Simpson’s indices,” one for similarity, and one for diversity. Indeed, each index has a completely different meaning and use, and I may well be the only one confused by them, but for a while, I had an impression that both indices are somehow related and maybe could be derived from each other. I did some search to clarify it for myself, and it turned out that although both indices have been published in roughly the same time (in the ’40s), each one is attributed to different Mr Simspon and appeared in quite a different context.

A bit of theory first. Simpson’s similarity index is used to calculate the similarity between a pair of community samples, to quantify whether their species composition is similar (they share most or all the species) or different. It comes in the company with other, similar indices, like Jaccard and Sørensen, which do the same job but with slightly different logic. If we have two samples, there will be some number of species shared between both, and some species occurring only in the first or only in the second sample. Jaccard or Sørensen divides the number of species both samples share by the number of all species occurring in both samples together (and Sørensen cares about shared species more than Jaccard, so it multiplies it twice in both numerator and denominator). The problem occurs if one of the samples has much higher species richness than the other. The number of shared species (in the numerator) may be the same, but the high richness of one of the samples will increase the denominator and decrease the overall index value, making the index losing the resolution power. Here comes the Simpson’s similarity index, which has the same nominator (number of shared species), but in the denominator, it takes only the number of species occurring in the smaller sample. Its use is recommended in a situation when there are large differences in species richness between the samples, and eventually, it turned out that this index is also the one quantifying the nestedness of two community samples (if the first sample contains the same species as the other one plus some more, the second sample is perfectly nested within the first one, and the value of the Simpson’s similarity index is equal to one).

Simpson’s diversity index is a completely different story. It aims to quantify the diversity of a single community sample, while considering both species richness (the number of species recorded in the sample) and also the relative species abundances (some species within the sample can be dominant, represented by a high number of individuals/higher biomass/larger cover, while some other are less common or rare). Two communities may have the same species richness (i.e. number of species), but if they differ by relative species abundance (i.e. evenness), their overall diversity will be different. If we have two forests, both with ten species (richness), but first with all species represented by the same number of individuals (perfectly even), while the second with one or two dominant trees (highly uneven), the first will intuitively (and also functionally) be more diverse – a walk through such forest will give you a feeling of diversity since you keep meeting different trees. In contrast, the forest dominated by a single or few species will feel species-poor (rare species are also there, so richness is the same, but they are not likely to be encountered as frequently as and also they are not likely to have an important ecological function in the community). Simpson’s diversity index, when calculating the diversity, is weighing the importance of species richness and relative species abundances. It comes in the company of his famous sibling, Shannon index, which does the same thing in a slightly different way. While Simpson’s index cares more about relative abundances, the Shannon index cares more about species richness; or, put in another way, the importance of rare species decreases in order species richness > Shannon index > Simpson index.

Now about the name. Both indices share the author’s surname, but in each case, it was a different Mr Simspon. Simpson’s similarity index was introduced by an American palaeontologist George Gaylord Simpson (1902-1984) to solve the situation often encountered in his practice with fossil samples, where some samples are much richer than the others, meaning that conventional similarity indices (like Jaccard and Sørensen) do not work properly. In 1943, G.G. Simpson wrote a paper “Mammals and the nature of continents” and published it in American Journal of Science; later, in 1960, he published (in the same journal) paper “Notes on the measurement of faunal resemblance”, where he also provided the formula for his index. The index has been rediscovered by Jack J. Lennon et al. in 2001 as βsim (often called beta sim; I suspect “sim” stems from “Simpson”). Note that Lennon et al. (2001) does not talk about “similarity”, but instead they used Simpson’s formula to calculate mean pairwise dissimilarities between the focal sample and the other samples; in Koleff et al. (2003), however, the beta sim is already considered as a pairwise dissimilarity measure based on the original Simpson’s index (1 – Simpson).

Simpson’s diversity index (also called concentration index) was published by British statistician Edward Hugh Simpson (born 1922, living in Oxfordshire), who is mostly famous for his formulation of Simpson’s paradox. E.H.Simpson published the index in the 1949’s Nature’s paper entitled “Measurement of diversity”. It quantifies the probability that two individuals took independently (without replacement) from the sample will be of the same species. With increasing diversity the index actually decreases (two individuals have the highest chance to be of the same species if the sample has only one species), so more often the one complement (1-Simpson) or reciprocal (1/Simpson) values are used, also known as Gini-Simpson index or Hurlbert’s PIE (probability of interspecific encounter, Hurlbert 1971).

So here we go – no reason to confuse them any more!

References

• Hurlbert, S.H. 1971. The nonconcept of species diversity: A critique and alternative parameters. Ecology, 52: 577–586.
• Koleff, P. Gaston, K.J. & Lennon, J.J. 2003. Measuring beta diversity for presence-absence data. Journal of Animal Ecology, 72: 367-382.
• Lennon, J.J., Koleff, P., Greenwood, J.J.D. & Gaston, K.J. 2001. The geographical structure of British bird distributions: diversity, spatial turnover and scale. Journal of Animal Ecology, 70: 966-979.
• Simpson, E.H. 1949 Measurement of diversity. Nature, 163: 688.
• Simpson, G.G. 1943. Mammals and the nature of continents. American Journal of Science, 241: 1-31.
• Simpson, G.G. 1960. Notes on the measurement of faunal resemblance. American Journal of Science, 258-A: 300-311.

# VegLab trip to Hehuanshan

October 13-15, 2016, we made our first VegLab field trip – a three-day trip to Hehuanshan to see the high-mountain non-forest vegetation. At the elevation around 3,200 m a.s.l., there is a mosaic of forest (dominated mostly by Abies kawakamii and Tsuga chinensis) and non-forest vegetation, mostly dominated by a bamboo Yushania niitakayamensis, with scattered Juniperus and Rhododendron shrubs, and a number of high-elevation herb species. Along the way, we discussed whether this is really a climatic timberline and what we see are actually alpine grasslands, or whether the non-forest vegetation is more likely due to some major disturbances (erosion, fire?). Several recent studies in this region (actually exactly at the place we hiked) studied the uplift of the timberline as a result of climate change (e.g. Greenwood et al. 2014 in Global Change Biology). But isn’t the advancing forest just a result of succession after disturbance? The key argument would be to know how frequent and how intensive are fires in this region. Is there a way to find details about them?

We slept for two days at the bottom of the valley, hiking along the stream to see the valley vegetation, and hiking uphill to the Yushania grasslands, still with some beautiful flowers. Although the weather was not really user-friendly (foggy the first night, raining hard the last day), I think the trip was successful; now we have a better idea how the timberline in the Hehuanshan region actually looks like.

Vegetation along the stream. Photo: D. Zelený.

Yen-Cheng and Tsung-Yi crossing the stream. The water was cold like hell… Photo: D. Zelený.

Campsite. Nice valley, full moon at night, foggy mornings. Photo: D. Zelený.

Cushion-like Artemisia. Not sure which species. On the gravel bar along the stream. Photo: D. Zelený.

Making relevé. We made a bet how many species will be in the plot 2×2 m large. Yen-Cheng bet 5, Tsung-Yi 10 and David 15. We found 12 – Tsung-Yi was the most close (although David spent considerable time trying to find at least one more species to beat him :). Photo: D. Zelený.

Cooking dinner – rice with beef and vegetable. Photo: D. Zelený.

Finito! Totally wet and hungry, but we made it! Photo: D. Zelený.

# Adjusted R2 in partial constrained ordination: the difference between R (vegan) and CANOCO 5

Recently I used R package vegan to recalculate several partial RDA analyses, which were originally calculated in CANOCO 5 (C5). Partial RDA (pRDA) is a constrained ordination which contains both explanatory variables and covariables. To my surprise, the values of explained variation (R2) and adjusted explained variation (adjR2) differ between vegan and C5, with values in vegan being systematically lower. Note that this difference will occur only in case of partial constrained ordination – if you do not have covariables, then R2 and adjR2 in vegan will be the same as R2 and adjR2 in C5.

After some searching and a help from Petr Šmilauer, I realized that vegan and C5 differ in the way how they calculate R2 and adjR2. In R, vegan is calculating so called semipartial R2, in which the denominator is total inertia only (i.e. pure variation explained by explanatory variables / total inertia). On the other hand, C5 calculates so called partial R2, in which variation explained by explanatory variables is divided by total inertia without the effect of covariables (i.e. pure variation explained by explanatory variables / (total inertia – variation explained by covariables)). Note that “pure variation explained by explanatory variables” is the amount of variation explanatory variables explain after removing the effect of covariables. More on this e.g. in Legendre & Legendre (2012), page 651.

In case of adjusted R2, the situation is also different, and slightly more complex. In both programs adjusted R2 is calculated using Ezekiel’s formula adjR2 = 1 – (1 – R2)*(n – 1)/(nm – 1), where n is number of samples in the dataset, and m is number of variables (note that if some of these variables is a factor, it brings “number of factor levels – 1” degrees of freedom).

In vegan, adjusted R2 of explained variation after removing the effect of covariables is calculated as fraction [a] in variance partitioning between explanatory variables and covariables, where [a + b] is variation explained by explanatory variables, and [b + c] variation explained by covariables. Then, adjR2[a] = adjR2[a+b+c] – adjR2[b+c], i.e. adjR2 of model containing both explanatory variables and covariables minus adjR2 of model containing only covariables. If p is the number of covariables (or number of degrees of freedom relevant to covariables, in case that some of them is a factor), than

adjR2[a] = adjR2[a+b+c] – adjR2[b+c] = (1 – (1 – R2[a+b+c])*(n – 1)/(nmp – 1)) – (1 – (1 – R2[b+c])*(n – 1)/(np – 1))

In C5, the adjusted R2 explained by explanatory variables after considering covariables is not calculated by subtracting adjR2 of the covariables from the full model, but directly by adjusting the original R2 values using modified number of degrees of freedom in the formula. Hence,

adjR2 = 1 – (1 – R2)*(np – 1)/(nmp – 1)

which means that the number of degrees for covariables is subtracted at both numerator and denominator of the right hand formula side.

Let’s see the difference on the real example. We will use Vltava dataset, and build the pRDA in both vegan and C5.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 library (vegan) # download Vltava dataset from http://www.davidzeleny.net/anadat-r/ vltava.spe <- read.delim ('http://www.davidzeleny.net/anadat-r/data-download/vltava-spe.txt', row.names = 1) vltava.env <- read.delim ('http://www.davidzeleny.net/anadat-r/data-download/vltava-env.txt') vltava.spe.hell <- decostand (vltava.spe, method = 'hell')   # Build partial RDA model with pRDA <- rda (vltava.spe.hell ~ pH + ASPSSW + Condition (SOILDPT + SLOPE), data = vltava.env)   pRDA # Call: rda(formula = vltava.spe.hell ~ pH + ASPSSW + Condition(SOILDPT + SLOPE), # data = vltava.env) # # Inertia Proportion Rank # Total 0.71576 1.00000 # Conditional 0.04264 0.05957 2 # Constrained 0.06245 0.08725 2 # Unconstrained 0.61067 0.85318 92 # Inertia is variance # # Eigenvalues for constrained axes: # RDA1 RDA2 # 0.03810 0.02435 # # Eigenvalues for unconstrained axes: # PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 # 0.06648 0.06138 0.04468 0.03336 0.02994 0.02693 0.02328 0.02114 # (Showed only 8 of all 92 unconstrained eigenvalues)   RsquareAdj (pRDA) # \$r.squared # [1] 0.0872482 # # \$adj.r.squared # [1] 0.07016251   # First, let's define variables with explained variations: constrained <- pRDA\$CCA\$tot.chi # 0.06244857 - variation explained by # explanatory variables after removing the effect of covariables conditional <- pRDA\$pCCA\$tot.chi # 0.04263834 - variation explained by # covariables total <- pRDA\$tot.chi # 0.7157577 - total variation of the compositional dataset   # In R, the R2 is calculated as constrained variation / total inertia: constrained / total # [1] 0.0872482   # and adjusted R2 is calculating by subtracting adjR2 explained by covariables # from adjR2 explained by both explanatory variables and covariables: R2abc <- (constrained + conditional) / total R2bc <- conditional / total n <- nrow (vltava.spe) # number of samples in the dataset m <- 2 # number of explanatory variables p <- 2 # number of covariables   adjR2abc <- 1 - (1 - R2abc)*(n - 1)/(n - m - p - 1) adjR2bc <- 1 - (1 - R2bc)*(n - 1)/(n - p - 1) adjR2a <- adjR2abc - adjR2bc adjR2a # [1] 0.07016251```

In CANOCO 5, if the same dataset and the same variables and covariables are used in the same pRDA analysis on Hellinger transformed data, the results are different (if you have CANOCO 5 installed on your computer, download and open this file):

```Analysis 'Constrained-partial'
Method: partial RDA
Partial variation is 64.61974, explanatory variables account for   9.3%

Summary Table:
Statistic    Axis 1    Axis 2    Axis 3    Axis 4
Eigenvalues    0.0532    0.0340    0.0929    0.0858
Explained variation (cumulative)    5.66    9.28    19.15    28.27
Pseudo-canonical correlation    0.7624    0.6491    0.0000    0.0000
Explained fitted variation (cumulative)    61.01    100.00

Permutation Test Results:
On All Axes    pseudo-F=4.7, P=0.001```

As you can see, R2 is 9.3%, while adjR2 is 7.3%, in contrast to vegan‘s 8.7% and 7.0% for R2 and adjR2, respectively. The sum of eigenvalues for the two constrained axes is 0.0532+0.0340 = 0.0872, the same value as the R2 in vegan, but in the header, C5 reports higher values. These are calculated in the following way (continuing the R code above):

```1 2 3 4 5 6 7 8 # R2 in CANOCO as R2.C5 <- constrained / (total - conditional) R2.C5 # [1] 0.09277488 - after rounding 9.3%   # adjR2 in CANOCO as 1 - (1 - R2.C5)*(n - p - 1)/(n - m - p - 1) # [1] 0.0730526 - after rounding 7.3```

Note that the other parameters are identical in both vegan and CANOCO (eigenvalues, F-value and also the significance).

In the end, I wrote this quick-and-dirty hack of the vegan‘s RsquareAdj function, which includes argument type with the values either “partial” (vegan’s version) or “semipartial” (C5 version):

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 RsquareAdj2.rda <- function (x, type = 'semipartial', ...) { m <- x\$CCA\$qrank n <- nrow(x\$CCA\$u) if (is.null(x\$pCCA)) { R2 <- x\$CCA\$tot.chi/x\$tot.chi radj <- RsquareAdj(R2, n, m) } else if (type == 'semipartial') { R2 <- x\$CCA\$tot.chi/x\$tot.chi R2p <- x\$pCCA\$tot.chi/x\$tot.chi p <- x\$pCCA\$rank radj <- RsquareAdj(R2 + R2p, n, m + p) - RsquareAdj(R2p, n, p) } else if (type == 'partial'){ R2 <- x\$CCA\$tot.chi/(x\$tot.chi - x\$pCCA\$tot.chi) p <- x\$pCCA\$rank radj <- 1 - (1 - R2)*(n - p - 1)/(n - m - p - 1) if (any(na <- m >= n - 1)) radj[na] <- NA } list(r.squared = R2, adj.r.squared = radj, type = type) }```

Quick check whether it works:

```1 2 RsquareAdj2.rda (pRDA, type = 'semipartial') # 0.08724 vs 0.0701 RsquareAdj2.rda (pRDA, type = 'partial') # 0.09277 vs 0.07305```

Looks like it does. Note, however, if you want to use it, it needs to be further tested (e.g. with datasets of environmental variables include also factors as explanatory variables or covariables).

# An experiment with bioRxiv.org

Today I just made an experiment – I submitted preprint of finished manuscript to bioRχiv (pronounced bioarchive) on bioRxiv.org, the preprint repository which aims to be the biological twin of arXiv.org.

Everything just happened at once – last morning I clicked through the Friday’s linkfest in Dynamic Ecology to the post of Zen Faulkner about his experience with uploading preprint to bioRxiv – in his eyes not too successful, since there were not extra views and comments on the paper, so he conclude that it may be just a waste of time if you are not scientific superstar and Nobel prize nominee. But I got inspired by the comments below that post – sure, to upload the preprint has the function of getting out fast a citable version of your work. At the same time, I finally finished working on a manuscript which is actually a kind of baby I cared about for way too long – it came already through two rounds of reviews (both rejected), and this is the third completely rewritten version prepared to be submitted for review again. But I really hope that at this moment it will be out – for the feeling that things are moving, and that I can link to that paper if I want, or even cite it in the follow up studies.

So I used bioRxiv for that – it’s pretty fast, the submission procedure is a bit similar to journal submission, but doesn’t contain so many questions. Before I submitted the manuscript for review to the journal I plan to, I first uploaded the pre-print here and note this in the cover letter for the submission to the “real” journal. There are options to upload edited or corrected versions any time, so it should be no problem to fix mistakes if there are some (sure there will be). As I understand the pre-print submission has no effect on the success of review in the peer-reviewed journal or potential acceptance/rejection of the paper there, so it sounds like just an added value of having something you can wave in the air.

The only thing which left me wondering was that actually once you upload the paper there, the paper is citable and – it cannot be removed. I spent a while hesitating what this may cause in the future – for example if I found that the paper is just a complete nonsense, I can’t just hide it away, it will be hanging there forever (whatever “forever” means). Hope this won’t be true, but let’s see what will happen.

PS: there is a Nature News article featuring bioRxiv.org almost three years ago when it was launched (you may click through here).

# TWINSPAN in R

TWINSPAN is perhaps one of the most popular clustering methods (at least among vegetation ecologists), which is not implemented in R. R-sig-eco forum has several posts (mostly from Jari Oksanen and Dave Roberts) on the topic of TWINSPAN in R, where they described difficulties with importing the original TWINSPAN code (written in FORTRAN) into R. Seems that both Jari and Dave spent considerable effort trying to implement the method into R, but seems like there is some problem which is not easy to crack.

From my experience (and I guess also from experience of many others, not only vegetation ecologists), TWINSPAN does sometimes give rather nice and ecologically meaningful results, since it is based on cutting the data along the main compositional gradients. There is yet another divisive method, somewhat analogous to TWINSPAN, called DIANA (DIvisive ANAlysis clustering), which was proposed by Macnaughton-Smith et al. (1964), described in detail by Kaufman & Rousseeuw (1990), and made available in series of Fortran written programs with poetic names like AGNES, CLARA, DAISY, DIANA, or FANNY (later implemented into R package cluster). But it seems to me that this method for some reason never gained so much popularity like TWINSPAN did, mostly perhaps due to its sensitivity to outlying samples (which will be separated first into one-item clusters). This is why I think it still would be nice to have TWINSPAN handy in R, also because of its modified version, which we introduced couple years ago (Roleček et al. 2009) and which up to now is available only in JUICE, program for editing and analysis of vegetation data developed by Lubomír Tichý (Tichý 2002). Modified TWINSPAN is basically a sequence of divisions calculated by standard TWINSPAN, each time applied on the most heterogeneous group – here is again a similarity with DIANA, which also in each step divides the group which is the most compositionally heterogeneous.

Recently, I got a simple idea, how to make TWINSPAN work in R, and how to extend this implementation also for modified TWINSPAN. The way I used is somewhat clumsy, but it seems to work. I used the twinspan.exe file, which is an executable program based on original FORTRAN library written by Mark O. Hill (author of the algorithm and the original FORTRAN code) and compiled in 2003 by Stephan M. Hennekens into a stand-alone program for use in MEGATAB, software which before was used together with TURBOVEG for editing and classification of vegetation data (Hennekens & Schaminée 2001, Schaminée & Hennekens 2001). I created R package twinspanR, which includes twinspan.exe, and added bunch of supporting functions to maintain import and export of data to twinspan.exe and back to R. So basically the TWINSPAN is calculated by the original Hill’s algorithm, and R functions in twinspanR package are for handling the whole thing conveniently in R (see the notes below for more technical details how the library exactly works).

For details how to install the library from GitHub see the Readme.md file. Some more examples how to use twinspan package will be (hopefully) made soon available on my website for analysis of community ecology data in R.

The following is an example code, using example dataset danube with Ellenberg’s meadow dataset (used also as an example in the first publication of TWINSPAN and DCA by Hill 1979 and Hill & Gauch 1980, respectively). In this example I used modified TWINSPAN (Roleček et al. 2009) with division into 4 groups and heterogeneity of clusters measured by Bray-Curtis dissimilarity measure. I projected the results into DCA ordination diagram, along to the original tabular classification made manually by Ellenberg (from Mueller-Dombois & Ellenberg 1974):

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## Modified TWINSPAN on traditional Ellenberg's Danube meadow dataset, ## projected on DCA and compared with original classification into ## three vegetation types made by tabular sorting: library (twinspanR) library (vegan) data (danube) res <- twinspan (danube\$spe, modif = TRUE, clusters = 4) k <- cut (res)   dca <- decorana (danube\$spe) par (mfrow = c(1,2)) ordiplot (dca, type = 'n', display = 'si', main = 'Modified TWINSPAN') points (dca, col = k) for (i in c(1,2,4)) ordihull (dca, groups = k, show.group = i, col = i, draw = 'polygon', label = TRUE) ordiplot (dca, type = 'n', display = 'si', main = 'Original assignment\n (Ellenberg 1954)') points (dca, col = danube\$env\$veg.type) for (i in c(1:3)) ordihull (dca, groups = danube\$env\$veg.type, show.group = unique (danube\$env\$veg.type)[i], col = i, draw = 'polygon', label = TRUE)```

Some technical details how the twinspanR package works

Some further details how it works. The executable file, twinspan.exe, is stored in \exec subdirectory of the library. There is also tw.bat file, which launches twinspan.exe and feeds it with data from R (I have shamelessly stolen this idea from the way how Lubomír Tichý executes TWINSPAN in JUICE, and it is also similar to the way how Tom August implemented another Hill’s FORTRAN library, FRESCALO (Hill 2011), into R as a function frescalo in package sparta). I take compositional data in R, transform them to required *.cc! format (luckily there is a function write.CEP in rioja package, written by Steve Juggins) and save them to \exec subdirectory (where also twinspan.exe is located). Then, I create file tw.dat with input parameters for twinspan.exe (using function create.tw.dat), and use the shell function in R to launch the tw.bat file. All calculations are done using original twinspan.exe, R just reads its output from tw.PUN file. It’s in a no way an elegant approach, but works just fine.

The twinspan.exe file used for calculation of TWINSPAN in R is taken from the distribution of JUICE program (Tichý 2002). It is the original FORTRAN code written by Mark O. Hill, compiled (cca in 2002) by Stephan M. Hennekens into twinspan.exe for use in his program MEGATAB (which was used together with Turboveg for analysis of community data). The version of twinspan.exe used here implements the changes by Petr Šmilauer, related mainly to problems with algorithm convergence, which cause the results being dependent on the order of samples in the input data table. Note that this algorithm is slightly different from TWINSPAN implemented in WinTWINS software (Hill & Šmilauer 2005), which implements also other modifications by Birks and ter Braak. In fact, there are at least four different versions of TWINSPAN recently used for analysis, and in certain circumstances they differ in the results how they classify the samples (when I have time, I will elaborate this topic further).

Unfortunately, the presence of twinspan.exe file in the twinspanR library is a problem for its portability – seems like CRAN doesn’t allow packages with executables inside (for understandable security reasons), and R-Forge allows it to be uploaded, but fails to build it. For now, the channel for distribution of this package is GitHub, and it will perhaps remain there until some other solution will show up. Simply, it’s just a quick and dirty way how to get TWINSPAN functionality in R without too much hassle, before somebody manages to write fully functional implementation of TWINSPAN in R.

References

• Hennekens S.M. & Schaminée J.H.J. (2001): TURBOVEG, a comprehensive data base management system for vegetation data. Journal of Vegetation Science, 12: 589-591.
• Hill M.O. (1979): TWINSPAN: A FORTRAN Program for Arranging Multivariate Data in an Ordered Two-way Table by Classification of the Individuals and Attributes. Cornell University, Ithaca, NY.
• Hill M.O. & Šmilauer P. (2005): TWINSPAN for Windows version 2.3. Centre for Ecology and Hydrology & University of South Bohemia, Huntingdon & České Budějovice.
• Hill M.O. & Gauch H.G. (1980): Detrended correspondence analysis: An improved ordination technique. Vegetatio, 42: 47-58.
• Kaufman L. & Rousseeuw P. J. (2009): Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
• Macnaughton-Smith P., Williams W. T., Dale M. B. & Mockett L. G. (1964): Dissimilarity analysis: a new technique of hierarchical sub-division. Nature, 202: 1034-1035.
• Mueller-Dombois D. & Ellenberg H. (1974): Aims and Methods of Vegetation Ecology. John Wiley & Sons, New York, Chichester, Brisbane, Toronto.
• Roleček J., Tichý L., Zelený D. & Chytrý M. (2009): Modified TWINSPAN classification in which the hierarchy respects cluster heterogeneity. Journal of Vegetation Science, 20:596-602.
• Schaminée J.H.J & Hennekens S.M. (2001): TURBOVEG, MEGATAB und SYNBIOSYS: neue Entwicklungen in der Pflanzensoziologie. Ber. Reinhold-Tüxen Ges., 13: 21-34.
• Tichý L. (2002): JUICE, software for vegetation classification. Journal of Vegetation Science, 13: 451-453.

# Mean Ellenberg indicator values: too good to be true

I started to think more intensively about Ellenberg indicator values issue at one conference while listening to the presentation, where a colleague used mean Ellenberg indicator values as explanatory variables in constrained ordination. I considered this as a kind of statistical heresy, perfect example of circularity of reasoning – you take your vegetation data, calculate mean Ellenberg indicator values for each plot, and in turn use these mean values to explain the original data. But it’s tempting – mean Ellenberg values are often considered as good proxies for measured environmental variables, and they are easy to calculate, so using them as explanatory variables is attractive. I tried that – I took a dataset with measured soil pH and calculated mean Ellenberg values for soil reaction, and compared how much variation in species data will be explained by pH and how much by mean Ellenberg; Ellenberg was a way better predictor than measured pH. Ok, so here we have the consequence of circularity. Thinking it through, I concluded that the reason is that mean Ellenberg values carry legacy of the species composition, from which they were calculated – if two plots have the same species composition, their mean Ellenberg values will be identical (considering mean not weighted by species covers), and if the species composition differ a bit, Ellenberg will change just slightly (changing one or a few numbers while calculating the mean doesn’t change the result too much).

I wondered what would happen if I reshuffle species Ellenberg values among species before calculation of mean for the vegetation sample, or if I replace the original species values by randomly generated ones. This would remove ecological meaning of the values, but keep the side effects of calculating the mean. No wonder – mean of randomized species Ellenberg values explains still more variation, when used in constrained ordination, then does random numbers (just keep in mind that every, even randomly generated variable, explains some variation – if this doesn’t make you happy, consider using adjusted R2 instead). There is no ecological information in these randomized values, so the extra explained variance is the legacy of species composition imprinted in mean Ellenberg values. I used randomization of species values as a base for modified permutation test, which can be applied for correcting the issue – not necessarily in constrained ordination, in which mean Ellenberg values are rarely used (scanning JVS and AVS journals through last ten years returned two papers), but also in unconstrained ordination, when mean Ellenberg values are correlated with ordination axes and the correlation is tested (actually this treatment is fairly common, although the problem with circularity of reasoning is exactly the same, yet not so obvious, as in constrained ordination).

I wrote an R code and later also a simple clicking program (MoPeT) running in R and calculating modified permutation tests, which is otherwise not easy to do. I presented the results in 2009 IAVS in Crete and later published a paper in JVS, together with André Schaffers, who helped me a lot in sorting the ideas and writing the manuscript. Still, I am not sure if ever somebody will really use this routine – mean Ellenberg values are great for description, but it’s perhaps better to keep them away from more sophisticated statistical treatments.

The story has actually an unexpected follow up, although I hoped that I won’t touch Ellenberg values any more. As a parody of life, I am just working on a paper describing statistical way how to justify the use of mean Ellenberg indicator values as explanatory variables in constrained ordination, and even do such things like partitioning the variation among different Ellenberg values or between Ellenberg values and measured variables. I presented this topic on EVS workshop in Vienna this spring, where it went through without any feedback – I guess the audience was even a little disgusted by such an overly technical talk. I really don’t feel like convincing somebody to use mean Ellenberg values as explanatory variables in constrained ordination, but I can’t help feeling quite fascinated by the imagination that something like this is actually possible.