Introduction to R for Ecologists

R語言在生態學的應用

**EEB5082 (B44U1940)**

Author: David Zelený

Permalink: davidzeleny.net/link/recol

Introduction to R for Ecologists

R語言在生態學的應用

**EEB5082 (B44U1940)**

Author: David Zelený

Permalink: davidzeleny.net/link/recol

recol:r_constructs

for (VAR in VECTOR) { STATEMENT1 STATEMENT2 ETC }

Create an element `VAR`

, which takes (sequentially) values of elements in `VECTOR`

. The length of `VECTOR`

defines how many times the loop will be executed (= number of elements in `vector`

= `length (VECTOR)`

). The body of the function (statements which are executed in each cycle) are wrapped together into curly brackets (`{`

and `}`

); if only one STATEMENT is inside the body of the loop, the curly brackets can be ommited. The `VAR`

can but does not need to be used in the body of the function. Variables defined within the body of the function and also the `VAR`

variable will stay in the Global Environment (this is different from the `function`

, which will create temporary environment, create variables there, and close it after the end of executing the lines of code in the function's body (these variables are not visible from Global Environment).

while (CONDITION) { STATEMENT1 STATEMENT2 ETC }

repeat { STATEMENT1 STATEMENT2 ETC }

if (CONDITION) { STATEMENT1 STATEMENT2 ETC } else { STATEMENT3 STATEMENT4 ETC2 }

Note: the `else`

part does not need to be present. CONDITION needs to be a single logical value (it can be a complicated logical expression, but if more than one logical value is returned by it, only the first one will be used (this is different from `ifelse`

function, in which CONDITION can be a vector (or matrix) with more than one logical value and all are used.

ifelse (CONDITION, VAR_IF_TRUE, VAR_IF_FALSE)

... where CONDITION can be a vector (or matrix) with more than one logical element - than the function “loops” through the whole vector/matrix and returns the object of the same dimension as CONDITION (dimension = the same number of elements in case of vector, or the same number of rows/columns in the case of matrix)

NEW_FUNCTION <- function (ARG1, ARG2, ARG3) { STATEMENT1_possibly_using_some_of_the_arguments_or/and_creating_RESULT STATEMENT2_possibly_using_some_of_the_arguments_or/and_creating_RESULT STATEMENT3_possibly_using_some_of_the_arguments_or/and_creating_RESULT ETC return (RESULT) }

The role of `return`

function in the end is to return the values into the global environment. When you run the function, it will create a temporary environment, which is not visible from the Global Environment. All variables defined inside the function are created only inside this temporary environment (so as they do not mix with those defined in the Global Environment). If you want the function to return something to the global environment, you need to *return* it from the function. This is done either explicitly by using `return`

function, or by typing the name of the variable in the end of the function, as if you want to print it into the console (means that instead of `return (RESULT)`

you may type simply `RESULT`

and it does the same thing; the difference is that `return (RESULT)`

can theoretically be somewhere inside the script, not necessarily in the end, while if you type only the name of variable, function returns the one which is the last one). You can return only a single object, so if more than one object needs to be returned, wrap them e.g. into a list and return the list of objects. If `return`

is used in the body of the function but is not in the end of the script, all lines of code after `return`

are ignored and not executed.

tiff (filename = 'FILENAME.tiff', width = WIDTH, height = HEIGHT, units = UNITS, pointsize = POINTSIZE, ...) PLOT_THE_FIGURE dev.off ()

The logic is the following:

- To plot into a file, you need to first open appropriate graphics device (
`tiff`

opens TIFF graphics device, but there is a number of others, e.g.`bmp`

,`jpeg`

,`png`

,`pdf`

,`postscript`

etc.). Give the file a name with appropriate extension (e.g.`*.tif`

in the case of TIFF format,`*.jpg`

or`*.jpeg`

in the case of JPG, etc) and set up plotting parameters (size of the image by WIDTH or/and HEIGHT, etc.). - Plot the figure itself (use high-level plotting functions like
`plot`

,`boxplot`

,`barplot`

,`hist`

etc., and optionally add low-level plotting functions, such as`points`

,`text`

,`lines`

,`axis`

,`box`

, etc.). - When you finish plotting, close the graphics device by
`dev.off ()`

function, to 1) save the file, 2) to return to the R graphics device. If you forgot to close the device by`dev.off ()`

, you cannot open the saved graphical file (and it gets zero bit size). If this happen, type`dev.off ()`

several times, untill you see the message`null device`

.

## A) Linear regression with linear regression line # 1) fit the model: LM <- lm (Y ~ X, data = DATA.FRAME) FOR.PREDICT <- c(min (DATA.FRAME$X), max (DATA.FRAME$X)) PREDICTED.VALS <- predict (LM, newdata = list (X = FOR.PREDICT)) # 2) plot the data: plot (Y ~ X, data = DATA.FRAME) lines (PREDICTED.VALS ~ FOR.PREDICT) ## B) Linear regression with non-linear (polynomial) regression line # 1) fit the model: LM2 <- lm (Y ~ poly (X, 2), data = DATA.FRAME) FOR.PREDICT <- seq (from = min (DATA.FRAME$X), to = max (DATA.FRAME$X), length = 100) PREDICTED.VALS <- predict (LM2, newdata = list (X = FOR.PREDICT)) # 2) plot the data: plot (Y ~ X, data = DATA.FRAME) lines (PREDICTED.VALS ~ FOR.PREDICT) ## C) "abline" solution (only for linear regression line) # 1) fit the model: LM <- lm (Y ~ X, data = DATA.FRAME) # 2) plot the data: plot (Y ~ X, data = DATA.FRAME) abline (LM)

Combination of `lm`

(linear model `lm (y ~ x, data)`

) and `predict`

(predicts the values of x variables lying on the regression curve), and then high-level graphical function (`plot`

, plots scatterplot) and low-level one (`lines`

, plots the regression line). Although it may feel a bit too complicated, it is the most general option, which allows for plotting any type of fitted curve (not only linear as `abline`

, but also non-linear).

Note: the function `predict`

, if without argument `newdata`

, will return the vector of predicted values for all values in variable x in the model. This may not always be useful, especially if the values in x are not sorted from lowest to highest, and we want to plot the curve (will produce the scatter of lines instead). Also, make sure that the values going to `newdata`

argument of `predict`

function is a `list`

, containing the variable with exactly the same name as in the regression model (`X`

in the example above), otherwise it won't work (it will behave as if the `newdata`

argument is not supplied, without any warning).

The `abline`

solution may seem as the simplest, but it is the least general - it plots only linear regression line (no non-linear shape), and it plots the curve always across the whole figure space, not only across the span of the data points. This may sometimes be fine, sometimes not, depends.

# X - independent (explanatory) variable (e.g. X <- cars$speed) # Y - dependent variable (e.g. Y <- cars$dist) # NPERM - number of permutations (e.g. NPERM <- 999) # 1) Plot observed data, fit the linear regression, and calculate r2 (observed r2) plot (Y ~ X) abline (lm (Y ~ X)) LM <- lm (Y ~ X) R2_OBS <- summary (LM)$r.squared # 2) calculate r2 of regression on the original variables after randomizing one of them (randomized r2); # repeat NPERM-times R2_RAND <- vector ('numeric', length = NPERM) for (perm in seq (1, NPERM)) { LM_TEMP <- lm (sample (Y) ~ X) R2_RAND[perm] <- summary (LM_TEMP)$r.squared } # 3) merge the randomized r2 and observed r2 into one vector R2_ALL <- c (R2_RAND, R2_OBS) # 4) calculate the probability that observed r2 can be generated in case that the null hypothesis # is true (ie by regression on randomized original variables) P <- sum (R2_ALL >= R2_OBS)/(NPERM+1)

Linear regression (`lm`

function in R) is relationship between dependent (Y) and explanatory (X) variable, fitted by least square algorithm. The effect size (how good is the fit) of the regression is the r^{2} (coefficient of determination, see the definition on Wikipedia). Disadvantage of the r^{2} is that it is always larger than zero, even in case that the two variables regressed against each other are random. Thus, to compare whether our observed r^{2} is considerably (“significantly”) higher than r^{2} of regression on randomized variables, we need to test the significance of the relationship (i.e. to test the null hypothesis that there is no relationship between X and Y). This can be done by parametric test (*F*-test in case of linear regression), or by non-parametric, permutation test (Monte Carlo permutation test, detailed here). Parametric test (*F*-test) calculates F-value, and given number of samples it calculates the *P*-value. The limitation of parametric test is that it strictly applies only for situation that Y for each X has normal distribution, which is often not the case. Alternatively, one can calculate significance using Monte Carlo permutation test. This test simulates the null hypothesis situation by randomizing one of the two variables (X or Y), calculating regression, and returning the *F*-value, r^{2} or other test statistic. If this null hypothesis is replicated many times (e.g. 999 times, i.e. number of permutations), we get reasonable estimation of the test statistic (let's talk about r^{2} here) in case that the null hypothesis is true (i.e. the variables are random, without relationship). We can compare this distribution with the real r^{2} (from real data without randomization) and calculate the probability that the observed r^{2} originates from distribution of r^{2} representing the null hypothesis. If this probability (*P*-value) is low enough (e.g. P < 0.05), we can reasonably assume that the null hypothesis is rejected.

recol/r_constructs.txt · Last modified: 2019/10/15 00:07 by david

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International