From Mendelarium in Mendel Museum, an exhibition about Mendel’s life and achievements. Actually, not too many personal items remained after him; glasses, microscopes, telescope, some notes written inside books and on pieces of paper, that’s it.
Mendel University in Brno. Mendel Square in Brno. Mendel’s Museum in Brno, with the lawn and remnants of a greenhouse where Johan Gregor Mendel did his experiments with more than 30,000 plants of a pea. I lived in Brno for years, but Mendel’s importance passed around me, perhaps like many other Czech people. I mean, I knew that Mendel is said to be the Father of Genetics, but I never really paid attention to the details of his life. Until this summer. My colleagues from the fellow institute at National Taiwan University asked me to record a “message from Brno” for their celebration of Mendel’s 200th birthday. And I started to discover something I had been ignoring for so many years. I visited the Museum in Old Brno Town, where Mendel did his experiments while being an Augustinian monk in St. Thomas abbey. The Mendel Square just beside the museum, where should have been the statue of dancing peas celebrating Mendel’s discoveries, is currently under noisy and delayed reconstruction, as is also the greenhouse where Mendel grew his plants and which underwent archaeological excavations and now is heading toward reconstruction. I saw the lawn where Mendel grew the plants, now equipped with information boards about his experiments, and also tables belonging to the adjacent coffee shop. I visited the exhibition in the Museum, which shows historical artefacts, including the manuscript of Mendel’s famous (and the only one) article about his hybridization studies.
Detail of Mendel’s precise handwriting (a page from the original manuscript of Versuche, displayed in the museum).
During Mendel’s life, his scientific explorations remained largely ignored or not understood, partly due to that time still unusual combination of biology and math, and partly due to the way Mendel presented his findings to others, reportedly full of questioning himself. I also visited the room where Mendel was living as a monk, which is now equipped with historical furniture and includes some of his personal items like glasses and a microscope. And finally, I saw the beautifully decorated library of the Abbey, even if only the public part (the private part with many more books is behind the secret doors in a wall). I bought a few magazines reporting about Mendel’s life and also about recent discoveries related to his DNA. Researchers quietly excavated his body from the grave of Augustinians, undertook detailed examinations of his body and DNA, and then (again quietly) buried him back, with only a handful of people knowing the details, before they released everything in scientific publications and follow-up conference. I put all this together into a 20-minute long “message from Brno”, which I was recording while still having brain fog caused by COVID. The recording was fun, done partly in Brno and partly in the yard of my grandma’s house. But as my colleague who saw it told me, I seemed to be pretty “subdued” (I had to search the word in the vocabulary, but yeah, it quite fits).
Message from Brno, the video I prepared for my colleagues about Mendel. I didn’t know how to end it up, so I…
All this gave me plenty of opportunities to think about Mendel, what kind of person he could have been (obese and neurotic monk vs friendly and generous genius), and gave me a chance to follow up a bit on his life. I attended the event organized in Brno for his 200th birthday, three days of celebrations, talks, exhibitions, and also concerts and food. Since I was still quite “subdued”, I was mostly hiding in the back, but I enjoyed the repeated visit to Mendel’s museum and also the gallery in Old Brno Monastery, which was breathing the feelings of the old time. For example, I saw an outstanding painting of Jerusalem, with a blue mosque close to the centre attracting attention, and reminded myself of the rules I read in an old book about the composition of photographs – whatever scattered the content of the image is, it should have a central theme, to which everything eventually draws observers eyes. I also attended a public discussion with Simon Mawer, the British author of several books related to Czech, including “Mendel’s dwarf”. He is retired, and the discussion (organized at Museum’s yard) was relaxed and inspiring. I bought Mendel’s dwarf and another of his books right the next morning in the bookshop and dipped deep inside it. Simon Mawer is originally a researcher, later turned into a writer, and the book is an interesting combination of rather scientific parts with some very personal (and even some explicitly sexual) passages about the life of Benedict Lambert, a dwarf who is remotely related to Mendel, and who spends his scientific career searching for the dwarf’s gene. I am not quite sure how much the parts about Mendel in the book (including some rather personal details) are based on true evidence and how much it is a literary licence, but perhaps it just doesn’t matter.
Interview with Simon Mawer, the author of Mendel’s dwarf. I asked him how he does manage to keep himself busy writing. He said every day is different, once up and once down, no golden recipe.
It was interesting to think about Mendel, someone considered an unlucky genius who remained not understood until his death. A guy who should have been a farmer but instead ended up being a monk in the Augustinian order, which offered him access to the latest scientific knowledge and allowed him to pursue further education. I was wondering how he himself thought about his experiments with peas, whether he was ambitious and then disappointed that people did not understand and appreciate it, or on the contrary, he wasn’t instead flooded by neverending questioning of his results, eventually leading to abandoning the pea experiments at all and focusing on his career as Abbot. Mendel was born in 1822 and published his “Versuche über Pflanzenhybriden” article in 1866 when he was 44 years old. I also turned 44 this year, and for not very clear reasons, I found myself balancing my personal and scientific past and hesitating about my future. It could be because of that long COVID, a surprise of this summer back in Czech. After three years of hiding in COVID-free Taiwan, I came back to Europe and almost immediately caught it, as expected. Symptoms were mild, like flu. But then, the sobering tiredness lasted for another three-four weeks, and most annoyingly, the brain fog, a weird feeling of limited attention span. I got scared. What if this is it, from now on, I will have intelligence close to that of broccoli? I was always slow when it came to doing science, but I felt lucky that I didn’t suffer a shortage of interest and ideas, and really enjoyed thinking. And all this was suddenly gone. Eventually, I recovered, I guess almost completely, although I still suffer from some troubles, especially when speaking and trying to express my thoughts, but I guess it’s just a matter of time to recover completely. But it was a lesson. Nothing is given. Everything can change at any moment in a direction I cannot really influence. Better to prepare for it, at least mentally, not to be completely devastated when it eventually comes.
My COVID tests – the first positive one on Thursday, June 30, and the first negative again on Sunday, July 10, 2022.
Taiwan was COVID-free heaven from the very beginning of the pandemic in 2020. Taiwanese, experiencing SARS in 2003 and facing quite high mortality, became very suspicious when something weird started to happen in Wuhan in the winter of 2019, and started to check and eventually ban flights from China. Thanks to a quick reaction, strict anti-pandemic rules and also the perfect cooperation from the Taiwanese people, Taiwan managed to escape the first several waves which were plaguing the world, with just a few hundred positive cases and handful of deaths. At the beginning of 2021, Taiwan was even considered COVID-free heaven, a refugium worth hiding in, while the rest of the World is fighting with new waves, lockdowns and restrictions. But heaven eventually turned into a jail. Because there was no COVID, people were not eager to vaccinate, but instead became quite scared of COVID breaking in. At the beginning of 2022, when most of the world opened and rushed to return back to pre-pandemic times, Taiwan remained closed, with strict rules about wearing masks everywhere, disinfection, tracking, ban for foreigners entering the country and also strict quarantine rules on arrival for those who could enter. Terrible news from China about their enforcement of the zero-COVID strategy finally led to slowly relaxing the anti-pandemic rules and eventually abandoning the zero-COVID strategy, but it was (and still is) a slow process. I was really looking forward to leaving all this for a while, after 2.5 years of not being able to go back home or anywhere else. To experience freedom, to finally catch COVID, and to come back.
So I did. I went back to Europe, first to Czech to see my family, and then to Spain for the conference. I was careful not to catch COVID right away, because I needed to attend the conference and didn’t want to bother with it while travelling and socializing. I was careful, used respirators all the time and washed my hands as much as possible, something others around me in Europe had already forgotten (and sometimes evaluated with spicy comments like “see that dick with a towel on his mouth”). But still, on the fourth day of the conference in Spain I felt lazy, tested by rapid test, and bingo, I was positive. I isolated myself in the hotel, with pretty mild symptoms, wrote an email to all my colleagues with whom I spent time the last three days to warn them, and started to take a rest.
COVID itself was no biggie; sneezing, a bit of headache, fever, sore muscles and joints, some cough. Tiredness was perhaps the most obvious. It came in waves, and every time I felt as if I would faint instantly and had to nap frequently. Coughing became really annoying and was the reason I visited my doctor when back in Czech, to get some pills. But the scariest problem, which evolved eventually and lasted for quite a while, was the brain fog. I knew there was “something like that”, as I heard about it on the news and from some friends who experienced that. In my case, it was an almost complete inability to concentrate, causing a pretty short attention span. At its best, I was not able to concentrate on one thing for more than five to ten minutes. If I tried hard, I felt dizzy and even felt as wanting to vomit. That really frightened me. My brain is perhaps the only part of my body which is useful, at least in the long run. I need to think, concentrate, contemplate, and if I suddenly cannot, what will I do, and will it come back? My job as a teacher and researcher is dependent on my brain. How will I pretend I am capable of thinking when I am actually not? And what else could I do? I never learnt any skill, any other profession. Without my brain, I am useless. It was depressing.
These symptoms lasted long three weeks before they started to recover, and even now, when I am writing this on the flight back to Taiwan more than a month later, I feel I am not quite there. I still have troubles expressing myself, especially if it is under pressure or in English. And I get tired quickly when I am thinking about something, especially if I have to multitask. I think that I need a backup plan. Something for the situation if the brain fog comes back and does not recover. Next COVID wave (although I hope those will already be milder and milder), or simply unavoidable ageing.
But maybe I am not alone in that? I was listening to a discussion about artificial intelligence on Economist the other day and strongly felt as if some of the speakers had the same problem as me, with longer times to find the right words and express themselves. What if, after COVID, the whole of humankind becomes a bit slower and a bit more stupid?
At least, the period of long COVID was not entirely useless for me. I rested a lot, because I had to. First in a self-imposed quarantine back in Czech, so that I don’t bring COVID to my parents. Then in Croatia, on holiday with my parents; I tried to go hiking with my Croatian friends twice, but felt like a lazy turtle unable to move. And then another week in a small room under the roof of my grandma’s house in the village, alone. It was a period of an exhaustive heatwave, but the village house was fine, cold even in hot summer. I did a lot of sleeping, some reading, a bit of diary writing, and every day I took a bike and went shopping in the neighbouring village for some goodies, an exercise which surely put me out of order for the rest of the day. I could not have such rest if I decided to have one, because, in ordinary times, I simply couldn’t afford that (and perhaps would die of boredom).
Everything bad is good for something (Czech saying). For me, COVID helped me to rest, contemplate about my life, and think about a backup plan, what to do if my brain says bye bye. Maybe I should learn how to work with wood and start making furniture…
Permanent vegetation plots allow resurveying of the vegetation, optimally in locations exactly identical to those of the original survey. For simplicity, here I will talk about forest vegetation permanent plots and exclude non-forest ones; there are quite a lot of permanent grassland plots (including e.g. GLORIA network on mountain tops), but these are practically all of a rather small size. Also, I will focus only on observational studies, and exclude experimental ones (e.g. adding fertilizer or testing management effects). For logistic reasons, most studies using permanent forest plots face the trade-off between how large the area of individual plot is and how many plots can be surveyed (see the schema above). I tentatively divided the studies into three categories according to the area of one single plot (grain scale). Small plots (100-400 m2) are relatively easy and fast to do and this allows for higher numbers of them to be established and surveyed. They usually come with no (or limited) information about within-plot vegetation structure and environmental heterogeneity, and the main focus is entirely on between-plot variation. Large plots (10-50 ha), on the other side, are difficult and slow to complete, which makes their overall number within a project (or even the database) rather limited. Most plots in this category follow ForestGEO (originally Smithsonian Tropical Research Institute) guidelines for establishing and surveying and include rich within-plot information about vegetation structure and (some) environmental heterogeneity. Very few studies however were able to use multiple of these plots to include also between-plot environmental heterogeneity.
And then there is a third, intermediate category, if middle-sized plots (around 1 ha), which are not so difficult and slow to do as large plots, but not so easy and fast to do as the small ones. They come in tens or so, and theoretically may allow the focus on both within-plot and between-plot vegetation and environmental parameters. There are, however, surprisingly few studies using this kind of data, and I wonder why is that so. One reason may be that there is a clear niche separation between many-small and few-large study designs, the earlier focusing entirely on between-plot differences (usually related to large-scale environmental factors such as climate or geology), while the latter (almost) entirely on within-plot characteristics (related to small-scale environmental variables, such as soil characteristics or topographical differences, and also small-scale demographic patterns). The several-intermediate plots design is a kind of compromise offering both large- and small-scale perspectives, but it may be that each is represented insufficiently to allow for proper conclusions. Another reason may be that there is still somewhat methodological inconsistency in how the several-intermediate plots are done, and this prevents from wider use of such data in synthesis studies; some plots follow the same protocol as large plots (ForestGEO), but others do not; the rules for which trees should be measured (e.g. whether trees with DBH > 1 cm, > 10 cm or even larger) differ among studies, as well as whether trees are permanently labelled or not. Data analysis may also be more problematic, as the several-intermediate plot design is strictly hierarchically nested – within-plot variability is nested in between-plot one, and subplots (made within each intermediate plot) are akin to split-plots within a whole-plot design. The analysis of such data is facing the limitation in the error degrees of freedom, as it needs to be analysed as a split-plot design.
For some types of studies, however, several-intermediate plot design may be more suitable than many small or few large. This may include studies focused on the effect of both large- and fine-scale environmental factors and possibly their interaction, while including more detailed information about the demography of individual stands collected through resurveys (e.g. relative growth rate of individual trees). For example, if the question is about the effect of cloud frequency (a large-scale environmental factor, changing across distances of kilometres) and soil nutrient limitation (fine-scale pattern affected by topography and changing on distances of tens of meters), combining together more detailed information from several locations can be useful. If in addition this information should be combined with e.g. trait response of individual species in the context of their demography (approximated e.g. by relative growth rate), the several-intermediate plots’ design should offer all that is needed. The benefit may be that some data from intermediate permanent plots are already available, usually within small projects focused on individual localities, and what is needed is their resurvey, the extension of the existing network and compiling data together.
Still, there are some important questions that need to be answered. Mainly what is the reasonable number of intermediate-sized plots that need to be available so that the analysis focused on regional environmental drivers has sufficient power? And is one hectare, a common area of the intermediate plot, sufficiently large to provide enough details about within-plot vegetation structure and environmental heterogeneity? While with the number of plots increases our ability to describe larger-scale environmental patterns and generalize the results, with the area of individual plots (with more subplots) increases our ability to get a detailed understanding of local patterns and processes, which are however very idiosyncratic and valid mostly for the given location and vegetation type. A study that would combine simulated spatially explicit data with the review (or meta-analysis) of existing studies may help estimate such numbers.
Finally, a more general question, not related only to permanent plots, but to many scientific studies: can a compromise solution be better than any of the extreme options? What if you just end up with the worst from the other alternative options…
Simply put, pseudoreplications are replicates in the study that increase the sample size (n) but do not increase the information amount. A typical problem is if we treat pseudoreplications as independent samples, assuming that each contributes a new piece of information (and hence the full degree of freedom). Here, the problem is equivalent to the issue of spatial autocorrelation in the data: the sample close to the focal one is likely to be more similar than a random sample selected from the dataset, and is therefore somewhat a pseudoreplication. One of the consequences is an inflated Type I error rate in the significance test, with results more optimistic than would be warranted by data. Sometimes a way more optimistic – see below.
Why did I write this post? Because of my recent experience as a reader of scientific papers, a reviewer and also an editor. I encountered repeatedly situations when authors analyzed their data with clear pseudoreplications as if all samples are independent. I even saw a few papers published in nice journals that claim an effect of a specific environmental variable on species richness/composition, while the effect in reality was just a result of inflated Type I error rate. (How do I know that? The authors provided the original data used for their analysis, and I could show that even if I randomize their environmental variable while keeping its hierarchical structure, I have a high probability to obtain significant results with otherwise completely irrelevant effects). Last but not least, I write it also as a potential author of papers with flawed results because I am not being honest enough (with myself and others) that I have a pseudoreplication issue in my own data.
Some background first
In experimental ecological studies, pseudoreplications are not that common, and if they appear, they usually result from a wrong sampling design and authors get eventually punished for them by reviewers. Less clear is the situation in observational ecological studies, based on field sampling. Look at the figure below for an example. Imagine that you want to study the change in diversity or species composition of some community (sampled within small square plots) along some environmental gradient (with intensity changing from left to right, bright to dark). Panel (a) shows a situation when individual samples are somewhat randomly distributed, in panel (b) they are somewhat clustered, and in panel (c) they are strongly clustered into three groups. We intuitively feel that the number of independent samples in panel (a) is close to 12, in panel (c) close to 3, and in panel (b) somewhere in between. In terms of pseudoreplications, while in panel (a) the samples are practically independent, in panel (c) the samples within the same group are almost complete pseudoreplications; they will likely have similar biological and environmental values since they are sampled in close vicinity of each other.
Schema of three sampling designs along an environmental gradient. Samples (squares) are located either somewhat randomly or evenly (panel a), somewhat clustered (b) or highly clustered (c). The pseudoreplication issue and level of spatial autocorrelation tend to increase from (a) to (c).
When analyzing data with some level of pseudoreplication, or spatial autocorrelation, we need to consider the hierarchical structure of the data in the model or modify the test of significance accordingly. In the situation of randomly distributed samples (panel a), we may think it is not necessary, especially if samples are far enough (but what is far enough?). In the situation of strong clustering (panel c), we would perhaps directly include “block” into the model, effectively reducing our sample size to the number of blocks. But most situations are somewhat in the middle; there is some level of clustering, and we are left not knowing how big a problem this could be. Since doing a simple analysis (such as linear regression) is much easier than the analysis corrected for spatial autocorrelation (such as spatial autoregressive models), we have good reason to slightly close our eyes and ignore the problem.
Numerical example with richness along elevation studied on one or several mountains
The example below illustrates the problem that pseudoreplications can cause if not accurately treated in the test of significance. Imagine that we study the pattern of richness (e.g. tree species) along elevation (see figure below). Each mountain (triangle) is separated into elevation zones (five in this example), and in each zone, we collect one or several samples and in each count the numbers of tree species. The figure below shows two scenarios, both generating datasets with 25 samples. In the first scenario (panel a) are five mountains, and each elevation zone in each mountain has only one sample. In the second scenario (panel b) is only one mountain, and each elevation zone in this mountain has five samples. Samples in the same elevation zone, but in a different mountain, may be considered as independent samples: they are far from each other, and their richness can be quite different (e.g. affected by other, regionally specific effects). In contrast, samples in the same elevation zone and the same mountain are clear pseudoreplications, since they are located close to each other and their richness is possibly rather similar (influenced by the same regional factors). For purpose of this example, let’s assume that different elevation zones on the same mountain are far from each other and are not spatially dependent (mountains are in reality not that sharp).
Two scenarios of studying richness along elevation: (a) one sample per elevation zone, in five mountains, (b) five samples per elevation zone, in only one mountain. Both datasets contain 25 samples, but replicates at the same elevation zone in the same mountain are likely to be pseudoreplications.
Let’s extend this example, add one more scenario and some math. We add the third scenario (not on the figure above, since I was lazy to click in PowerPoint), which again has only one mountain, but each elevation zone has 100 samples, increasing the sample size to 500 samples. And now the math: we will analyze the correlation between richness and elevation while generating the richness values of individual samples randomly (i.e. creating a dataset where the null hypothesis, that richness and elevation are not dependent, is true, and our test should not reject it – or at least nor reject too often).
For scenario 1 (five mountains, one sample per each of the five elevation zones, n = 25), I assume that samples in the same elevation zone but on different mountain are independent, and I generated species richness of each sample as a random value from a uniform distribution between 0 and 100. For scenario 2 (one mountain, five samples per each of the five elevation zones, n = 25), I assume that samples in the same elevation zone and the same mountain will not be completely independent and will have rather similar richness values, since the richness (along to elevation) may be affected also by some locally specific factors (e.g. deforestation of the locality in the past). I first generated a richness value for each elevation zone (five random numbers from a uniform distribution between 0 and 100); each sample within the same elevation zone will get the richness of that zone plus randomly generated value (normal distribution, mean zero, sd = 10); in this way, the richness of samples within the same elevation zone slightly different (but not very much). For scenario 3 (one mountain, 100 samples per each of the five elevation zones, n = 500) I repeated the same as for scenario 2, but increased the number of samples within the same elevation zone to 100.
One example for each scenario is shown in the figure below (panel a, b and c for scenarios 1, 2 and 3, respectively). The x-axis shows hypothetical elevation (from 200 to 1000 m asl), and the y-axis shows the species richness of individual samples. I calculated the correlation between richness and elevation for each scenario and recorded whether the correlation t-test was significant at P < 0.05 or not. I repeated the simulation 100 times and counted how many significant results I get for each scenario (out of 100). Since we know that the null hypothesis is true (because we generated our data as such), and we decided that Type I error rate of 5% is sufficiently low to consider the result as sufficiently different (we consider P values lower than 0.05 as significant), we can expect that 5% of the tests for each scenario will be false positives (around 5 out of 100).
Panel (d) shows the number of results significant at P < 0.05, compared to the expectation (5%). While for scenario 1 the number fits the expectation (is around the proportion of 5%), scenario 2 has almost 50% of significant results, and scenario 3 almost 90%.
Few notes at the end
The conclusion from the numerical example above could be: if you want to show that richness is related to elevation (even though it maybe does not), do not bother with climbing too many mountains, just try hard to make as many samples as you can on just one of them! But God bless you with convincing reviewers and readers that all your samples are independent replicates and they can trust your simple correlation results…
The more general suggestion would be: if your sampling design is like the panel (a) or (b) in the first schema above, check for autocorrelation in residuals of your model, and if it is there, be honest and correct for it, or at least report that it may possibly be a problem but you haven’t corrected for it (and why). If your design resembles the situation on panel (c), analyze your data directly as the block design (e.g. using linear mixed effect models with blocks as random effects). If you do not, while possibly having many replicates within each “block”, you are likely to hugely lie with your P-values. Highly significant but very small effect size (like low R2) will for sure cast doubts on your results.
One obstacle to analysing community ecology data with linear ordination methods (principal component analysis, redundancy analysis) is their reliance on Euclidean distances to quantify compositional differences among communities (sites). Euclidean distance is known to be sensitive to double zero problem, i.e. species that are missing in the species composition of both compared communities. There exists a reasonable argument that for species composition data (occurrences of species in different communities), the absence of species in both compared communities does not really reveal meaningful information about their ecological resemblance. The species may be missing because the environment of each community is outside of the ecological niche of the species, but then both communities could be at the same end of the gradient (e.g. both too dry for species to occur, and hence environmentally quite similar) or each community on the other end of the gradient (one too dry, the other too wet, hence environmentally quite different). Distances that can be meaningfully applied for species composition data should preferentially ignore double zeros (be “asymmetric” – treat differently double presence and double absence). In the case of relatively homogeneous species composition data (with minimum double zeros) the Euclidean distance can be useful.
In the numerical example below, I will show the comparison of Euclidean and Hellinger distance in the context of the double zero problem. This comparison will show that Euclidean distance is really affected by the double zero problem, and the effect is strongest when the species composition data are transformed (by log or even presence-absence transformation). Hellinger distance, in contrast, is claimed as not sensitive to the double zero problem, and the illustration below clearly illustrates this feature. For comparison, I included also the third distance metric, Bray-Curtis, which is a quantitative version of Sørensen similarity and is known to be asymmetric (ignoring double zeros).
I took an example dataset about forest vegetation sampled in the Vltava river valley (Czech Republic), which is compositionally rather heterogeneous (and therefore contains a large number of double zeros). Dissimilarity (Euclidean, Hellinger and Bray-Curtis) is calculated among each pair of the 97 vegetation plots (communities). Abundances in these plots were either used raw (percentage cover of individual plant species estimated in the field), log-transformed as log (x+1), or transformed into presences and absences. The proportion of double zeros was for each pair of communities calculated as the number of species jointly missing in both compared communities, divided by the number of all species in the whole dataset (i.e. not only species present in the two communities, but all species present in all 97 vegetation plots).
I displayed the relationship of the proportion of double zeros (x-axis) against each dissimilarity metric (y-axis), with Euclidean, Hellinger and Bray-Curtis distance in the upper, middle and lower rows of the figure below, respectively, and applied on data either untransformed (first column), log-transformed (middle column) or transformed into presences-absences (right column). I have not calculated any statistic to quantify the relationship between distances and proportions of double zeros, and haven’t tested it, since individual data points represent pairwise comparisons among pairs of communities in the dataset, and some of them share the same community (and are therefore not independent).
The results show that Euclidean distance is related to the proportion of double zeros in the community, and this relationship is very strong if the original species composition data were transformed into presences-absences (but is also obvious for log-transformed data, less for untransformed). Hellinger distance, in contrast, is not directly related to the proportion of double zeros. In the case of Hellinger distance, the pattern appears a bit triangular: if the proportion of double zeros is low (i.e. the communities share most of the species), calculated dissimilarity intuitively cannot be too low; on the other side, if the double zero proportion is high, the dissimilarity can be both low or high (depending on which species are being shared among communities). Bray-Curtis distance behaves very similarly to the Hellinger distance; in fact, the Hellinger and Bray-Curtis distances are rather correlated, and this correlation is very strong when calculated from presence-absence data (pattern not shown here).
The relationship of the dissimilarity between the pair of communities (y-axis, either Euclidean, D_eucl, or Hellinger, D_hell, or Bray-Curtis, D_bray) and the proportion of double zeros in the compared pair of communities (x-axis, double_zero).
In this context, I also focused on Hellinger standardization (or transformation, call it as you wish). The reason is that Hellinger distance between pair of communities can be calculated by taking the raw species composition data (with original or transformed abundances, e.g. by log or sqrt transformation), and applying the Hellinger standardization on it (which involves relativization of abundances within the community by dividing abundance of each species by the sum of abundances in the community, and then by square-rooting the relativized value). If we apply Euclidean distance on such Hellinger standardized raw data, we will get Hellinger distance.
I visualized the process of standardization in the figure below. Panel (a) shows the concept of two-dimensional species space, where each dimension is defined by an abundance of one species (species 1 on the x-axis and species 2 on the y-axis). Communities (circles) are located in this space according to the abundance of the two species. Panel (b) shows the distribution of 120 samples (communities) in the two-species space. Absolute values of the abundances are not relevant, important is that the units on the x- and y-axis are the same; e.g., in the yellow sample which is at the second row from the bottom the most at the right, the species abundances could be [100, 20], [10, 2], or [1, 0.2]. Panel (c) shows the result of the first step in Hellinger standardization, which is the calculation of “species profiles” (species abundances is standardized by row sums, i.e. the sum of species abundances in its community). Samples standardized to species profiles are located on the hypotenuse of the right-angled triangle with corner coordinates [0,0], [0,1] and [1,0]. Panel (d) shows the position of samples after the second step of Hellinger standardization, square rooting of species profiles; all samples are located at the perimeter of the circle, which has a radius of 1.
What is interesting on Hellinger-standardized abundance values is that communities with the same colour in the original two-species space have the same abundance after standardization.
Visualization of Hellinger standardization. (a) Position of the community sample in the two-dimensional space defined by species 1 and species 2 is defined by the abundances of these species in the community. (b) Distribution of 120 communities with different abundances of species 1 and species 2. (c) Position of communities after standardization to “species profiles” (first step of Hellinger standardization) and (d) square-rooting (the second step of the standardization). The animation of the whole process can be seen here: https://vimeo.com/689244325/339fd4c86e
Sarothamnus scoparius, according to recent nomenclature Cytisus scoparius. Photo by Evka Šmerdová (née Hettenbergerová), borrowed from www.pladias.cz.
The title of this post is aimed at my Taiwanese students, but the content is hopefully good also for a broader audience. Since I started to teach at National Taiwan University in 2015, I am constantly facing one interesting resistance from my students: learning Latin names (of plants in my case, but it applies in general, also for other organisms). I know it’s not easy (well, what is?). Latin is very different from the Chinese language most students use, and it may not seem so useful at first, especially when others around also do not use it. So why bother? I did my own survey (highly biased and with very low n), asking students why they think they need to learn Latin names. Far the most common answer (just after “because you force us”) was: because of foreigners, they don’t know Chinese names, so if they ask us, we need to tell them “scientific name” (some directly say “English name”, as if Latin and English are the same thing). At that point, I felt a bit guilty, because it means if not me asking the names in Latin, they don’t need to bother, and can go well with Chinese ones. Fair enough, but maybe that’s not how it works.
Let me depart from the topic, but just for a while. Recently I read an opinion piece in newspapers, which commented on the strategy of the Taiwanese government to turn Taiwan into an officially bilingual country in 2030. The author of that article (a foreigner living in Taiwan for more than 30 years) claimed that this is an impossible task. Not because people cannot learn English. But because – why should they? Just because “the government decided” to show how progressive Taiwan is and how good a place for foreign investors it could be? That does not sound like a good motivation for people to climb on a top of so steep mountain, or even start thinking about that. Unless people experience the real need and usefulness of English, it is hard to motivate them to learn it. Just like the Latin names.
I started to learn Latin names in the first year of my undergraduate study at the Biological Faculty. During the first week of the first semester, which we spent in the field station in the middle of the lovely south Bohemian landscape, teachers told us: now, forget the Czech names of plants, you won’t need them anymore. Learn the Latin names. No “discussion” allowed, in fact, there was nothing to discuss: from the first excursion, teachers always show us all the plants and tell their Latin name first, Czech one only if someone asked. When we did the final test from the excursion, we walked with the teacher and had to say the name of any species s/he will point at; Latin name counted for 2 points, Czech name just for 1 point. It was not easy; Czech is a Slavic language, not a Latin one, and those Latin names, with few exceptions, really don’t sound like anything familiar. I still remember how I was walking there and back through the yard of the field station before my first final test, memorizing “Sa-ro-tha-mnus sco-pa-ri-us” and trying to find similarity with something in Czech (untranslatable Sáro, tam hnus! Skopárius!). And yes, at the beginning we murmured a lot with other classmates why should we learn something that we will never need in the future! How wrong I was, by the way; those who were not kicked out of the school in the first year of the undergraduate study will agree that it would be impossible to get a Master or even just finish undergraduate study without knowing the Latin names of the organisms we study. Not because teachers push us, later they don’t need to: everyone was using Latin names, and it would be really difficult (and embarrassing) not knowing them. Now, I teach vegetation ecology at National Taiwan University; how would it be if I resisted and kept learning only Czech plant names at that time?
For English, my beginning was even more painful. Till the Velvet Revolution (end of Communism in Czech, 1989) we learned Russian at school as the second language (I was eleven that year). Then, our Russian language teacher almost overnight turned into an English language teacher, and ever since I studied English, through the rest of my elementary, secondary and also university study. When I decided to travel to New Zealand for a volunteering job in the Department of Conservation (that was 2002 and I was 24, just with a fresh Master diploma in my pocket), I thought I speak English pretty well, after sooo many years of learning it. How wrong I was I learned already at Auckland’s airport, when I got out of the flight and (surprisingly) everyone spoke English, with that unmistakable New Zealand accent (Gud daj!). The immigration officer asked me “Do you have some food in your backpack?”, and I said “yes”, because I thought she is asking about “foot” and I thought that “foot” is the same as “shoes”. She said “but you cannot have any food in your backpack” and wanted me to throw them out; I resisted arguing “but it is hiking food”; after I showed her my hiking boots, it was clear that there is something not quite working. But yeah, after half a year in New Zealand, speaking English all the time and Czech only occasionally with my family and friends using a phone (no Skype or Facebook that time!), I learnt how to use English. I do have a weird accent, though (one British girl told me I sound pretty Indian, whatever she meant by that), but I can survive with that.
It never came to my mind that I would be learning the Latin names of plants or study English because of “foreigners”, who may not know the Czech plant names or ask me something in English on the street. And yeah, there are pretty many foreigners in Czech, but who cares. In the beginning, I was pushed to do that, mostly by my annoying teachers, who insisted we need to learn it (I can never thank them enough for their wiseness and resistance to our ignorance!). Later, no one needed to convince me to do that, because I found how important it is for me personally. You don’t learn something “because of others”. You always learn it because of yourself. The knowledge and skills you have are what makes you special, and opens new possibilities for you. So, please, don’t waste your energy by searching for arguments why you just can’t or don’t need to do that, and just do that.
When we
compare vegetation zonation in smaller and isolated mountains with that in
large mountain ranges, we often find that in smaller mountains, the same
vegetation zones occur in lower elevations than in larger, more “massive” ones.
This pattern, first described from European Alps by German researchers at the
beginning of the 20th century, is known as Massenerhebung effect, or mass
elevation effect (Massen = mass, erhebung = uprising or elevation). In Taiwan,
this pattern has been pointed out by Su (1984a,b), who plotted elevation ranges
of main vegetation zones in Taiwan along latitude (see below). Since Taiwan island
stretches mainly in a latitudinal direction, one would expect that vegetation
zonation will steadily decrease toward the northern, cooler part of the island.
Instead, the schema from Su (1984b) shows quite clearly that “[t]he
altitudinal level of these zones shift downward towards both northern and
southern tips as a result of Massenerhebung”, a pattern which is most
obvious for lower Quercus zone (lower cloud forest).
Latitudinal pattern in distribution of vegetation zones in Taiwan. Combination of Fig. 3 from Su (1984b) with part of Fig. 1 from Su (1984a).
Scientific literature
recognises three potential reasons why Massenerhebung effect occurs. First is thelandmass heating effect:
large mountainous masses slow down the speed at which temperature decreases
with elevation (or, in other words, decreases the lapse rate). The temperature
at mountains in certain elevation is always higher than the temperature of the
free air at the same elevation; this is because the solar rays reflected by
mountains have a considerable heating effect on the surrounding air. A solitary
mountain (e.g. volcano) is not capable of reflecting as much heat as an entire
mountain range, and as a result, the temperature at the same elevation in the
solitary mountain will be cooler than at comparable elevation in an extensive
(“massive”) mountain range. The other is the effect of wind:
wind will cool down the outer mountain margins or isolated mountains more than
inner parts of mountain ranges. Windward parts of the mountains are usually
more humid, which means that more heat has to be spent to evaporate the water
instead of heating the ground. Inner parts of the mountains will be dryer and
warmer (having more “continental” climate). The last is the effect
of cloud: isolated mountains and mountains in outer margins of
mountain ranges in coastal regions will be more exposed to prevailing cloud
formations. High air humidity causes lower air temperature, and high water
content in the soil results into slower humus decomposition. Slow decomposition
rate causes lack of nutrients available for vegetation, typical for the mountain
cloud forest; this may be partly the reason why at isolated mountains and
mountains at the margin of mountain ranges the cloud forest occurs at lower
elevations than in the central parts of the mountain ranges. Some authors
consider as Massenerhebung only the first effect (landmass heating), others
include all three together since their effect is not easy to be separated.
On a large scale, Massenerhebung effect is noticeable for example in Himalaya, where the timberline (the upper climatic boundary of the forest) shifts to a considerably higher elevation than in smaller mountains of comparable latitude. On a scale of Taiwan, along with the descriptive study of Su (1984b), the pattern can be recognised also on the distribution of cloud forest from data in the vegetation database (Schultz et al. 2017, their Figure 2), which shows the hump shape pattern (red crosses indicate cloud forest plots).
Latitudinal pattern of cloud forest vegetation (red crosses) in Taiwan (Schulz et al. 2017, their Figure 2). Yes, it looks a bit strange; I modified the figure by stretching it five times along the vertical axis, to magnify the humshaped pattern of the cloud forest (red crosses).
As far as I know, there is only a single study which is attempting to explain the environmental correlates of Massenerhebung effect in Taiwan. Chiou et al. (2010) used data from National Vegetation Database of Taiwan, and concluded that the cooling effect of northeastern monsoon is more important than Massenerhebung effect caused by the landmass heating. Authors selected 76 woody species occurring commonly across the island, and for each of them estimated the upper altitudinal limit based on the database data. If only data from the eastern part of Taiwan were used, results showed decreasing elevation pattern with increasing latitude (lowest in the northern part of Taiwan and higher in the central and southern part), while in the case of the dataset from western part there was no clear trend. Authors concluded that since they have not detected the decrease in the altitudinal limits in the southern parts of Taiwan, the “[h]eat retention of Massenerhebung is not the mechanism which can best explain the patterns of species altitudinal distribution in Taiwan”.
Distribution of vegetation plots used in the study of Chiou et al. (2010).
My modest
opinion is that although authors of Chiou et al. (2011) use a rather large
dataset of vegetation plots (1604 vegetation plots, each 400 m2),
the analysis itself is not too persuasive, and I would be more careful with
bold conclusions of no Massenerhebung in Taiwan. It seems that in their
definition, Massenerhebung effect includes only the landmass heating effect,
while the effect of wind and cloud are considered separately. But at the same
times, authors have chosen three areas in Taiwan of comparable elevation span
(each reaches the elevation well over 3000 m asl), while not including the
marginal mountain systems at the northern and southern part. But this means
that the massiveness of all three selected regions is comparable, so I am not
sure why authors expect to see the decreasing trend caused by lower landmass
heating effect. Additionally, authors used only 76 out of almost two thousands
woody species in Taiwan, including only the most common species occurring in
all three selected regions. So, not including marginal mountains (the less “massive”
far north and far south part of the island) and using only the limited number
of species may be exactly those reasons why the hump shape pattern was not
detected. But who knows.
The disadvantage of the schema drawn by Su (1984b) is in a relatively subjective nature of the data it builds upon. Although it is based on more than 200 vegetation records from different elevation zones and different watersheds, combined with data from published studies, still, the boundaries of vegetation zones are delineated somewhat arbitrarily, not really using any quantitative tool. But since now we have data from the National Vegetation Database of Taiwan and the classification of forest vegetation types according to the study of Li et al. (2013), it is possible to construct similar schema using the real vegetation plots. This study, which Woody wrote together with co-authors as a part of his PhD during his study in Brno (Czech Republic), contains a description of 21 zonal and azonal forest vegetation types in Taiwan (zonal types are those determined by climate, while azonal are those determined mainly by edaphic conditions). The paper additionally contains one convenient instrument: Cocktail Determination Key (CoDeK), an R-based application which allows automatic determination of vegetation plots into vegetation types described in the paper, based on plot species composition. I used CoDeK to assign plots into vegetation types, chose only plots belonging to zonal vegetation types and merged analogous tropical and subtropical types into the same units. I plotted latitudinal-altitudinal distribution of the plots and fitted the lower boundaries of each vegetation type with quadratic quantile regression. Except for high-elevation conifer-dominated forest types, all other types show clear hump shaped pattern, with a decrease not only in the northern, but also in the southern part of Taiwan. And since zonal forest types in Li et al. (2013) are largely analogous to vegetation types in Su (1984b), the schema shows a nice agreement with the one in Su (1984b) and supports his observation.
Latitudinal-altitudinal pattern of forest vegetation zones (data from National Vegetation Database of Taiwan, with only zonal forest vegetation types according to Li et al. 2013). Lower boundaries of vegetation zones are delineated by quadratic quantile regression curve.
Indeed, the hump shape pattern of vegetation types along latitude is just an empirical observation and does not say anything about factors which are is causing it. In fact, it is quite likely that the resulting pattern is a result of all three factors together – landmass heating effect may be responsible for uplift of vegetation zones in central, drier and less windy parts of the mountain ranges, while the effect of wind and cloud may be responsible for downslope shift in the marginal areas. Additionally, land-use changes caused by human activities also may play a role. With all data readily available, it should not be so difficult to find it out.
References
Chiou, C.-R., Song, G.-Z.M., Chien, J.-H., Hsieh, C.-F., Wang, J.-C., Chen, M.-Y., Liu, H.-Y., Hsia, Y.-J., & Chen, T.-Y. (2010). Altitudinal distribution patterns of plant species in Taiwan are mainly determined by the northeast monsoon rather than the heat retention mechanism of Massenerhebung. Botanical Studies, 51: 89-97. https://ejournal.sinica.edu.tw/bbas/content/2010/1/Bot511-11.pdf
Li C.-F., Chytrý M., Zelený D., Chen M.-Y., Chen T.-Y., Chiou C.-R., Hsia Y.-J., Liu H.-Y., Yang S.-Z., Yeh C.-L., Wang J.-C., Yu C.-F., Lai Y.-J., Chao W.-C., & Hsieh C.-F. (2013). Classification of Taiwan forest vegetation. Applied Vegetation Science, 16: 698–719. https://doi.org/10.1111/avsc.12025
Schulz, H.M., Li, C.-F., Thies, B., Chang, S.-C., & Bendix, J. (2017). Mapping the montane cloud forest of Taiwan using 12 year MODIS-derived ground fog frequency data. Plos One, 12: e0172663. https://doi.org/10.1371/journal.pone.0172663
Su, H.-J. (1984a). Studies on the climate and vegetation types of the natural forests in Taiwan (I): Analysis of the variations in climatic factors. Quarterly Journal of Chinese Forestry, 17: 57–73.
Su, H.-J. (1984b). Studies on the climate and vegetation types of the natural forests in Taiwan (II): Altitudinal vegetation zones in relation to temperature gradient. Quarterly Journal of Chinese Forestry, 17: 1–14.
The Internet is flooded with explanations related to P-values (To P or not to P?) and problems around multiple testing (the more tests you do, the more significant results you get even if everything is random). I don’t bring anything new, I just wrap it for myself and add a few examples that hopefully make some concepts clearer. I will talk about Type I and Type II error rate, the power of the test, disadvantages of multiple tests, how it looks like when we use the correction for multiple tests (not many significant results left) and what else we could do with that (some global analysis?). Read further if you also think about these things.
P-values
Several decades ago, ecologists were strongly arguing that it is important to start using rigorous statistical approaches and standards in ecology; that was the era when ecology studies were based mostly on natural history observations instead of systematically sampled data analysed by rigorous statistics. Meanwhile, ecology grew into a standard scientific discipline, and some start to argue that P-values are often being misused and misunderstood, and should be banned from ecological publications (as it happens in medicine or psychology). Putting this discussion aside, P-values are still representing a useful tool telling us how much evidence we have against the null hypothesis, which we are testing.
P-value is the probability of obtaining the value of test statistic equal or more extreme than the observed one, even if the null hypothesis is true. The null hypothesis is usually formulated as “there is no relationship between the variables” (or there is some fixed relationship, from which we are testing the difference). For example, if I relate species composition to some environmental factor using constrained ordination, P-value is the probability of obtaining the test statistic (R2, pseudo-F-value) equal or higher than the one we truly calculated, even if the environment is truly not related to species composition (null hypothesis is true). If the P-value is sufficiently low (e.g. less than 5%, as is the common standard, proposed originally by Ronald Fisher), we may conclude that it is quite unlikely that the result could be obtained at a situation that the null hypothesis is true and state that the observed value is “significant”. That 5% (or in general alpha level) is the probability of Type I error, and we report such result as “significant” at given alpha threshold (e.g. P < 0.05).
Some examples. First, let’s generate a data for which we are sure that the null hypothesis is true: we generate two random variables, independently from each other; then we calculate the correlation between them (Pearson’s r) and test it using parametric t-test (all this is done by ‘’cor.test’’ function in R):
We obtain some P-value (e.g. if you include ”set.seed (1234)” before the script above, the P-value should be P = 0.8020633, i.e. not significant at alpha = 0.05). Yet, even if these variables are randomly generated, occasionally we still can obtain a “significant” result with P-value smaller than alpha, with a probability of alpha (try it – you can e.g. ‘’set.seed (1257)’’, and you obtain P-value = 0.001678019, well significant at 0.05 level).
Now, the second situation – we generate data, for which we are sure that the null hypothesis is not true; variable x is still generated randomly, but variable y is the function of x, with added random noise. The same – calculate correlation and test it:
If in the code above you ‘’set.seed (12345)’’, you got P-value = 0.001676154, well below the 0.05 threshold, so you would conclude that the value is significant at P < 0.05. But of course, it may also happen that the result is not significant, even it should be (e.g. with ‘’set.seed (12355)’’, you obtain P-value = 0.06517154, which would already not be significant at the 0.05 significance level).
To summarize, see the possible outcomes in the table:
The left column shows the situation that the null hypothesis is, in reality, true (our first example with randomly generated variables). If I still reject it, I conduct the Type I error or a false positive. The probability that this will happen is set by me and equals to alpha level at which I decided to reject the null hypothesis (alpha = 0.05 or 5% in this case). If I did not reject it, I conduct the true negative (with the probability of 1 – alpha).
The right column shows the situation that the null hypothesis is in reality not true (our second example, where y is dependent on x because I enforced this dependence). If I reject the null hypothesis, I did correct decision (true positive), if I did not reject it, I conducted the Type II error or a false negative). The probability of Type II error is beta, and the probability of true positive is 1 – beta, also called thepower of the test. Beta and the power of the test depends on several factors – test itself, how powerful it is (some are more and some less powerful), and also the data – sample size, and amount of variance. In our case, the data are quite noisy and I sample only 100 samples, so it may be not so easy to find the relationship which is there (this will improve if we increase the sample size, e.g. to 1000).
Now, in reality, I do not know whether the null hypothesis is true or not (that is why I want to test it), and I need to collect the data and use the statistic to get an idea. Both Type I and Type II errors are problematic – it means that I rejected something I should not (Type I error) or did not reject something I should (Type II error). The probability of Type I error is selected by me (here 0.05), while the probability of Type II error is dependent on three factors: the sample size (larger sample size means lower Type II error), the selected value of Type I error (lower alpha means higher beta), and the true difference of the data from the null hypothesis, i.e. true effect size (the higher is the difference, the lower is Type II error rate – it’s easier to reject the null and conduct true positive). This means that decreasing our choice of Type I error rate (alpha) increases the probability of Type II error (beta) and decreases the power of the test (1 – beta). Not an easy life; best solution – sample more to get higher sample size, which will reduce the probability of Type II error (of course, not always possible).
Both the effect size and the number of samples in the dataset are also influencing the P-value (it decreases with both the effect size and the number of samples), so in case of a large dataset, we are likely to get strongly significant results even if the true effect (e.g. R2) is pretty small. Some argue that datasets with more than several hundreds of samples the P-value already does not represent a useful tool for the decision, because virtually anything could get highly significant. Also, the threshold for rejecting the null hypothesis (often set as P < 0.05) should not be taken too seriously. It does not mean that at P = 0.048 we have evidence against the null hypothesis, but at P = 0.052 we already do not. Better is to consider P-value as a continuous measure, and use those thresholds (P < 0.05, P < 0.01, P < 0.001) only as a recommendation to which results we should pay more attention to. The other extreme, i.e. interpreting everything, would return us several decades back, to the era when interpretations of ecological studies were based on feelings and intuition rather than on the results of statistical analyses.
Multiple testing issue
Sometimes the study includes an analysis which is based on testing more than one null hypothesis. Imagine example: you want to know whether plant functional traits (e.g. leaf area, SLA) are related to some of the environmental factors (e.g. elevation, soil P). We will keep technical details of how this can be done aside, just imagine that you have ten different traits and ten different environmental variables, and you collect 100 samples (e.g. vegetation plots) in which you do all these measurements. Then, you take each of the ten traits and regress it against each of the ten environmental variables, test the significance, and interpret significant results. What is the problem?
Let’s do the same exercise as above. First, simulate the situation when all null hypotheses are true (variables are randomly and independently generated). R code gets more complex, so I deposited it in the Github repository here. What I do: I generate two random variables from the normal distribution, each having 100 values, calculated regression of y on x, and tested the significance (by parametric F-test in this case; note that if I calculate Pearson’s r correlation, the P-value would be identical). I repeat this 100 times and highlighted significant results by adding the red regression line into the scatterplot.
You see that I got a few significant results, although I should not – the null hypothesis is truly correct, I generated data in this way. These are false positives, and their probability is equal to the significance level alpha I set up (here 0.05, i.e. 5%). You can see that if I repeat the same analysis again, I got again some significant hits, just this time in different combinations of traits and environmental variables.
In fact, I can calculate the probability that, for given number of tests and given alpha value, I will get at least one significant result (with P < alpha), even if the null hypothesis is true: p{at least one significant result} = 1 – (1 – alpha)m, where m is the number of tests. In case of 100 tests, the probability of obtaining at least one result significant at P < 0.05 is 1 – (1 – 0.05)^100 = 99.4; I can be almost sure that I get at least one!
Now, the second scenario, when all 100 relationships are generated with null hypothesis not true – I created variables which are dependent on each other. The same as above applies – I test each of them, and if significant, I add the red regression line. In this case, I coloured the background each scatter by light blue, to indicate that the variables are truly not independent.
The number of hits is high, but not all combinations are significant. Those which are not are false negatives, and those which are significant are true positive; the probability of true positive equals to the power of the test, in this case, something around 50% (How do I know that? Repeat the analysis many times, e.g. 1000 times, to see the proportion of true positives; this gives you the power of the test, from which you can calculate Type II error rate as beta = 1 – power). If you repeat the same thing, again quite high proportion of relationships will be significant (true positives), but some not even they should (false negatives).
In reality, we don’t know for which combination of variables the null hypothesis is true and for which it is not. Let’s make a third scenario, in which we mix both options – some of the relationships are generated from independent variables (null hypothesis is true), some from related variables (null hypothesis is not true). For simplicity, the 10 relationships in the first row are those with null hypothesis not true, and the rest (90) with null hypothesis true.
The figure shows the whole spectrum of situations: true positives (blue background with red regression line), true negatives (white background, no regression line), false positive (white background, regression line) and false negatives (blue background, no regression line). The problem is that, in reality, we do not know which of the relationships have the blue background (null hypothesis is not true), so when we do the interpretation one by one, we may well interpret true positives (correct!) as false positives (not correct!).
How to deal with this? One way is to adjust P-values of individual tests to correct for multiple testing issue. The simplest is the Bonferroni correction, which multiplies each P-value by the number of tests done (Padj = P * m) and then evaluates Padj instead of the original P for the significance. This helps to control for the overall (or family-wise) Type I error rate, but at the same time, it quite drastically reduces the power of the test, and you are unlikely to get false positives, but you are quite unlikely to get true positives. There are better methods then Bonferroni (e.g. FDR, false discovery rate, or Holm’s correction), but the drastic reduction of test power is true for all of them.
If we increase the power of the test by increasing the sample size, this problem will be less serious, but this is not always possible. Below, I increased the number of samples in each comparison from 100 to 500.
This looks great! No false positive, and only two false negatives. However, in real data, this is quite unlikely to happen, because you won’t have the power of the test more than 99% as in this case.
Seems that multiple testing strategy is not the best way to go – either you suffer from too many false positives, or you correct and get (almost) nothing significant. If possible, opt for other types of analysis, some which conduct a single global test instead of the series of pairwise ones, or combine these two strategies together. A well-known example of such an approach is one-way ANOVA and post-hoc comparison. ANOVA tests whether the mean of at least one of the groups deviates from the overall mean; if yes, then one can conduct a pairwise post-hoc comparison of groups to identify which of the groups are those different (still, this pairwise comparison is not a simple series of t-tests, but includes an implicit correction for multiple testing). In the case of trying to find the relationship between a set of two variable types (e.g. traits and environment as in our example), you may opt for constrained ordination, which is in fact multivariate regression (many traits regressed on many environmental variables, more precisely CWM-RDA in this case, but details are beyond this text). If the ordination is significant, then go for pairwise regressions to see which of the variables are truly related (still, to use some of the corrections for multiple testing is advisable).
You may ask: so if I measure things, relate them to each other and find a significant relationship, does it mean that these variables, in fact, may have no real meaning, they could be in fact just random variables? The answer is yes, it may happen that this result is just a false positive. The way to make sure that the relationship between those variables (e.g. given trait and environment) truly exists is to replicate the study, best in a different location. Or, research the literature to see whether ever someone detected the same relationship in a similar context. It’s unlikely that the same combination of variables would be a false positive in many studies (you see above that false positives occur, but in the new replication it occurs in different variable combination). If you repeat the study again (and again and again) and you get the same relationship significant (or you see multiple papers reporting the same thing), that is strong evidence that it may truly mean something.
Where to go next?
A nice readable text on this topic is the chapter The p value and the base rate fallacy in the book Statistics Done Wrong by Alex Reinhart (see the comics there!). Also, a few years ago there was an inspiring series of forum papers in Ecology, started by the paper of Paul A. Murtaugh entitled In defense of P values (Murtaugh 2014), followed up by several others and wrapped together again by Murtaugh. Also, there is recently an interesting suggestion that we should redefine statistical significance and lower the alpha threshold from 0.05 to 0.005, which will greatly reduce the probability of reporting false positives in scientific studies and increase reproducibility. As it goes, one of the first comments (from many) to this paper is entitled “Remove, rather than redefine, statistical significance”…
Sometime in 2008, I made a surprising finding. I was doing a routine analysis: unconstrained ordination of community composition data (I think it was DCA) and fitting mean Ellenberg indicator values (1) onto ordination diagram for ease of interpretation (using function envfit from vegan). What you get is an ordination diagram, where points are samples or species, and vectors are mean Ellenberg indicator values pointing to one or other direction, and showing which way nutrients or pH or light increases. To decide which of the six mean Ellenbergs to display, I tested the regression of axes with mean Ellenbergs using permutation tests and included only those which were significant (often most of them are, mean Ellenbergs are so good!). To my surprise, when I was looking at the ordination, it just did not make ecological sense. I knew my data, I knew where to expect more nutrients, higher pH and more light, but the arrows pointed to different directions, although they were almost all significant.
I rechecked everything again, and found the mistake – in Excel, where I sorted Ellenberg values for calculation, I sorted the species in different ways than in the species composition matrix. I actually calculated mean Ellenberg values from values of different species than species occurring in the plots. But how come that I still got so nicely significant results? I tried several times and found that indeed, even I calculate mean Ellenbergs from permuted or randomly generated species Ellenberg indicator values, I still have high chance to get a significant relationship with ordination axes, although the direction of the vectors just makes no sense. I did more experiments: I correlated mean Ellenbergs calculated from randomized indicator values with species richness or comparing differences between groups of samples in clusters from numerical classification. The same problem repeated: the chance to get significant relationship was pretty high, even though the species indicator values were just random.
I presented this on the IAVS meeting in Crete in spring 2009 and wrote the first draft of a paper for the Journal of Vegetation Science in summer 2010. It was a rather short Forum paper, but reviewers did not like the structure, so I reshaped it into a standard research article, and after three rounds of reviews (and with one of the reviewers, André Schaffers, becoming the new co-author, since he greatly helped me to make the text much readable), it was finally published in autumn 2011 as “too good to be true” paper (Zelený & Schaffers 2012). I was happy like a monkey; this was my first paper where I got my idea published, and it still remains my paper with most citations. In the paper, we suggested that one should be aware of the problem with inflated Type I error rate, and either not to test the relationship of mean Ellenberg values with ordination/species richness/cluster analysis results, or to use what we called “modified permutation test”, which compares observed relationship with those of mean Ellenberg values calculated from randomized species indicator values. I made a simple R application to do this test (MoPeT, https://davidzeleny.net/wiki/doku.php/eiv:software) and wrote a blog post about that (https://davidzeleny.net/blog/2012/09/23/ellenberg-values-too-good-to-be-true/). The story could be finished.
But it wasn’t. By playing around, I found that the same problem occurs even if I relate the mean Ellenbergs from randomised species indicator values with environmental variables, like soil pH. This started to be more serious and perplexing at the same time. Why is this happening? When relating mean Ellenbergs with ordination, one can argue that the problem is that two things which are not entirely independent are in fact tested against each other: mean Ellenbergs are calculated from indicator values and species composition, and ordination/richness/clusters are calculated from the same species composition matrix (I call this “similarity issue” in the 2012’s JSV paper). But when relating mean Ellenbergs with environmental variables, there is no such thing, environment is completely independent from species composition (Brad Hawkins, whom I met later and became co-author on his paper related to community weighted mean, calls environmental variables “extrinsic”, in contrast to “intrinsic” ones, which are derived from species composition data; Hawkins et al. 2016).
I wrote a short paper about that and in November 2013 I first time submitted it. I aimed high, because it was clear that this is a serious problem and many studies using community weighted means (not only Ellenberg values but also traits) are influenced by overly optimistic results. The first submission went to Ecology Letters (rejected in four days based on editorial reading), then to Journal of Ecology (this took five days for editors to reject – they thought the paper is too statistical) and finally to Methods in Ecology and Evolution, where it went for the review. The result was reject-resubmit, with one of the reviewers being Bill Shipley, who gave me quite interesting suggestions about different types of null hypotheses tested by CWM approach. I presented the idea at European Vegetation Survey workshop in Ljubljana (beautiful city!), Slovenia, in May 2014 (https://www.davidzeleny.net/doku.php/evs2014), and then again in July at International Statistical Ecology Conference in Montpellier, France (https://isec2014.sciencesconf.org/31446). There I met Stéphane Dray, whose R workshop about ade4 package I took before the conference, and later also Pedro Peres-Neto. After my presentation, we three had quite an emotional discussion in the canteen. Pedro and Stéphane showed me their almost finished manuscript about the same thing I was talking at the conference – that relationship of community weighted means with environmental variables has inflated Type I error rate if tested by standard tests.
I closed myself in the hotel for the rest of the meeting and worked hard to prepare the resubmission of the paper rejected in MEE, with only breaks watching the FIFA World cup in which Brazil lost with Germany 1:7 (I never watched football before and never after, but this one was really an awesome experience: https://youtu.be/RLZUKqpXYzU). In the end, I was not that fast; I submitted the rebuttal to MEE in October and got rejected again a month later, this time with Cajo ter Braak as one of the reviewers. While Cajo was reviewing the paper, he contacted me to clarify some details and we had lively email exchange (some of the emails I wrote at night from a tent while sleeping at the tea plantation close to central Taiwan during a survey trip to a one-hectare plot in nearby forest). Again, I got useful hints on how to go further, but started to feel more and more frustrated about the fact that I cannot get the message through.
That was also the time I was about to move to Taiwan permanently to start my new life as Asian, and it took me almost one and half year to submit the new version, this time to Ecology (April 2016, rejected by editor) and again Ecology (August 2016, rejected after reviews). But, it became clear that I am late; in July 2016, the paper of Pedro, Stéphane and newly also Cajo got published in Ecography (Peres-Neto et al. 2017). First I thought let’s forget it and move on, but in the end, I remade the paper into a Synthesis format, where I mostly reviewed what is known, and brought few new things from the previous versions(2). I submitted it to the Journal of Vegetation Science in April 2017, got reject-resubmit, remade again and submitted in June 2018, got a minor revision, and, finally, got acceptance on the first October 2018. Five years from the first version, eight submissions and five different journals(3), yeah!
I was never a fast guy, at least not when it comes to writing papers. However, this one was really an extreme case and made me learn a lesson. I knew the problem, and I knew that the issue is actually quite important, but I could not get it through and publish it. In fact, later, desperately, I started to post preprints of each submitted manuscript to BiorXiv (e.g. Zelený 2016), to get a feeling that there is at least something out there.
Looking backwards, I see the main reasons why it did not work. I was not able to crack the problem down, because I do not have this type of skill. I am not statistician or mathematician, although I often dig methodological problems, and in this case, it showed up as a handicap. Although I knew about the problem of inflated Type I error rate, my solution (modified permutation test) did not work properly, which at the beginning I was not able to fully understand. Also, I spent quite a lot of time in pushing the analogy of weighted mean and compositional autocorrelation, which, clearly, most people cannot understand, although I felt that it is so clear (no, it was not, but still, even now I cannot get rid of that idea). In addition, I struggled in finding the way how to structure the manuscript that does not have a standard “report” structure. I studied several other papers trying to figure out how the flow of the text should go and get inspired, but from the reaction of reviewers (complaining about my “hard to follow” text) I could see I was not able to nail it. And, at few moments, I got a bit too emotional from the whole situation, not able to see things rationally and from distance. First, this happened when I read the paper of Otto Wildi (2016), who was criticising our 2012 JVS paper. I felt quite frustrated; not because of the critique (some of them justified, some not), but because although Wildi’s paper contains several obvious misunderstandings, it could still get published(4). Another emotions showed when Peres-Neto et al. paper appeared online in Ecography. First I was even not able to read it; later I did (also because I needed to refer to it), and I found that it is actually a pretty good one, pointing clearly to the whole problem and bringing an elegant solution.
Although my interest already shifted somewhat away, I still do have some plans left with weimea (how I call weighted mean problem for myself). I hope to finish the R package (yes, weimea), which is already pretty functional (https://github.com/zdealveindy/weimea, also appended to 2018 JVS paper). And there is actually a couple of things in the previous manuscripts which never made it through to the final published version, although they are important. Hope to be faster this time…
(1) Mean Ellenberg indicator values (mEIV) are a kind of European speciality; Heinz Ellenberg, the German plant ecologist, created a set of species indicator values, where he assigned to most of the Central European species an ordinal value according to their ecological preference along main environmental gradients like nutrients, moisture or soil pH. When you make vegetation plot with a list of species occurring at a specific location and calculate mean of indicator values for species occurring in it, you get quite meaningful estimated of ecological conditions for that location, as plants interpret it. Ellenberg was not the only one who got this idea, but his system was perhaps the first one that was rather comprehensive, an important criterion for a successful application.
(2) The original acknowledgment, which was in the preprint on BiorXiv but which I later removed, reflected the whole story in short: “My thanks go to Bill Shipley, Cajo ter Braak and several anonymous reviewers for critical comments on previous versions of this manuscript, which motivated me to several times heavily rework it. Thanks also go to Pedro Péres-Neto and Stéphane Dray for (emotional) discussion of differences between my modified permutation test solution and their fourth-corner one during the ISEC 2014 conference in Montpellier.” Acknowledgements are sometimes the only place in the paper recording the background story of the paper, although, of course, seen only from the author’s point of view.
(3) List of the submitted manuscript titles, journals and dates:
Relationship between community-weighted mean trait values and environmental variables or ecosystem properties is biased due to compositional autocorrelation (Ecology Letters, submitted Nov-2013, rejected Nov-2013);
Relationship between community-weighted mean trait values and environmental variables or ecosystem properties is biased (Journal of Ecology, submitted Nov-2013, rejected Dec-2013);
Relationship between community-weighted mean trait values and environmental variables or ecosystem properties is biased (Methods in Ecology and Evolution, submitted Dec-2013, rejected Feb-2014);
Linking weighted-mean of species attributes with sample attributes: the problem of inflated Type I error rate (Methods in Ecology and Evolution, submitted Oct-2014, rejected Nov-2014);
Bias in community-weighted mean analysis relating species attributes to sample attributes: justification and remedy (Ecology, submitted Apr-2016, rejected Apr-2016);
Bias in community-weighted mean analysis relating species attributes to sample attributes: justification and remedy (Ecology, submitted Aug-2016, rejected Oct-2016);
Bias in community-weighted mean analysis of plant functional traits and species indicator values (Journal of Vegetation Science, submitted Apr-2017, reject-resubmit May-2017);
Which results of the standard test for community weighted mean approach are too optimistic? (Journal of Vegetation Science, submitted Jun-2018, minor revision Aug-2018, accepted Oct-2018).
(4) One example of the misunderstandings for all: in Zelený & Schaffers (2012) we did not say anything about Ellenberg indicator values as being biased, and still, Wildi spent considerable space in arguing that they are indeed not (because they are not measured variables), and even put this argument as a title of the paper (Why mean indicator values are not biased). What we claimed was that “results of analyses based on mean Ellenberg indicator values may be biased”, which is something I still stand for, although I would perhaps not use the term “biased” in this context anymore (and use inflated or overly optimistic instead). At first, I started to write a reply paper, but gave up when I found that I would have to spend too much energy in correcting those misunderstandings, proving that the author perhaps have not read the original paper properly. But not all was negative: Otto was right with the argument that the testing relationship of mEIV with ordination scores/richness/cluster results goes against the statistical logic, because from the definition these two variables are not independent (both are numerically in some way derived from the same species composition matrix), so the null hypothesis of non-independence is easy to be rejected. I touched on this in the 2018 JVS paper, arguing that if we see this as a case of spurious correlation, the modified permutation test (testing the modified null hypothesis) is a valid solution.
References
Hawkins B.A., Leroy B., Rodríguez M.Á., Singer A., Vilela B., Villalobos F., Wang X. & Zelený D. (2017): Structural bias in aggregated trait variables driven by species co-occurrences: a pervasive problem in community and assemblage data. Journal of Biogeography, 44: 1199–1211. https://doi.org/10.1111/jbi.12953
Peres-Neto, P.R., Dray, S. & ter Braak, C.J.F. (2017) Linking trait variation to the environment: critical issues with community‐weighted mean correlation resolved by the fourth‐corner approach. Ecography 40: 806-816. https://doi.org/10.1111/ecog.02302
Wildi, O. (2016). Why mean indicator values are not biased. Journal of Vegetation Science, 37: 40-49. https://doi.org/10.1111/jvs.12336
Zelený D. (2016) Bias in community-weighted mean analysis relating species attributes to sample attributes: justification and remedy. bioRxiv. https://www.biorxiv.org/content/early/2016/04/05/046946 (this is the first version, later several times updated)
Zelený D. (2018): Which results of the standard test for community weighted mean approach are too optimistic? Journal of Vegetation Science 29: 953-966. https://doi.org/10.1111/jvs.12688 (free read-only pdf)
Years ago my computer addiction included playing cards (I think it was called Solitaire), browsing websites (news during the day, porn at night), later also Facebook and some other social apps. Now I found myself addicted to R. However stupid it may sound, it seems to be true. By addiction I mean: it is not easy to stop when you need to do something really important. All those things before (cards, news, porn, Facebook) are understandable, and I bet I am not the only one who fights with that. But R? Come on. I thought that I had a perfect therapy last year after the semester of teaching R three hours per week (and countless hours preparing the class and checking assignments). Right, I was sick of seeing R for a while, but not for long. It was around that time I also found Shiny, an interactive option how to play R, and coded my first Shiny app. It’s nice – you click, something changes, click again, changes again, keep clicking, improving, adding new ideas, making new apps… and addiction is cooked, no escape.
Some time ago I came with an idea to introduce “No Computer Day.” A day, when you should not touch the computer, at all. Obviously, it cannot be during the working day (unless I go to the field), so the weekend is a hot candidate. It would be cheating to set the No Computer Day on the day you go out of the house – so as so you would not touch it. No, just set up the ordinary Sunday, when you have a plan to relax, but ordinarily, you would still end up sitting in front of the computer (I may do something useful today) and keep clicking (doing nothing useful at all). It’s similar to “a week without alcohol” (all bottles in the house need to be locked somewhere since the late evening feeling of “just one beer” is surprisingly strong). Just leave your computer switched off (not just sleeping – you may touch the mouse, and it will wake up, with blinking screen luring your attention). You will see how difficult is it to resist (ok, I will just check emails, what if something important is there!), but how rewarding it could get (clean the house, read the book, fix what should have been fixed long time ago, go for a walk, sleep – add your own “no-computer” items).
Sometimes I wonder: how did the life of scientist (let’s say a teacher at university doing also research, or simply a full-time scientist in academia) looked like before the computers were standing on her/his desk? Come to office, make a coffee, check the mails, do paperwork (ok, this sounds almost the same as now). And then? Take a pencil, read a paper and write a notes, take a typewriter, write a short piece of manuscript, put it away, take a pencil again, make a phone call, take a typewriter, go for lunch, meet a colleague for coffee, pencil, paper, typewriter, … Anyway, the most important currency of scientists are the ideas, and those are not born in the computer, but in the head, often while doing things entirely not related to work or when the brain is simply idle. What if sitting the whole day in front of blinking computer makes the time for these idle moments rare, and thus the productivity goes down? Ok, by productivity I don’t mean the number of published papers (although I should, since that’s precisely how my productivity is measured by school/grant agency/colleague). By productivity I mean coming up with some interesting idea. What if switching off the computer (or even temporarily moving the computer somewhere out of the sight) in fact increases this kind of productivity?
Ok, but I still need to boil these ideas down to a paper…