User Tools

Site Tools


recol:r_constructs

R constructs

"for" loop

for (VAR in VECTOR)
{
  STATEMENT1
  STATEMENT2
  ETC
}

Create an element VAR, which takes (sequentially) values of elements in VECTOR. The length of VECTOR defines how many times the loop will be executed (= number of elements in vector = length (VECTOR)). The body of the function (statements which are executed in each cycle) are wrapped together into curly brackets ({ and }); if only one STATEMENT is inside the body of the loop, the curly brackets can be ommited. The VAR can but does not need to be used in the body of the function. Variables defined within the body of the function and also the VAR variable will stay in the Global Environment (this is different from the function, which will create temporary environment, create variables there, and close it after the end of executing the lines of code in the function's body (these variables are not visible from Global Environment).

"while" loop

while (CONDITION)
{
  STATEMENT1
  STATEMENT2
  ETC
}

"repeat" loop

repeat
{
  STATEMENT1
  STATEMENT2
  ETC
}

"if" and "else" conditional function

if (CONDITION)
{
  STATEMENT1
  STATEMENT2
  ETC
} else {
  STATEMENT3
  STATEMENT4
  ETC2
  }

Note: the else part does not need to be present. CONDITION needs to be a single logical value (it can be a complicated logical expression, but if more than one logical value is returned by it, only the first one will be used (this is different from ifelse function, in which CONDITION can be a vector (or matrix) with more than one logical value and all are used.

"ifelse" conditional function

ifelse (CONDITION, VAR_IF_TRUE, VAR_IF_FALSE)

... where CONDITION can be a vector (or matrix) with more than one logical element - than the function “loops” through the whole vector/matrix and returns the object of the same dimension as CONDITION (dimension = the same number of elements in case of vector, or the same number of rows/columns in the case of matrix)

"function" construct

NEW_FUNCTION <- function (ARG1, ARG2, ARG3)
{
  STATEMENT1_possibly_using_some_of_the_arguments_or/and_creating_RESULT
  STATEMENT2_possibly_using_some_of_the_arguments_or/and_creating_RESULT
  STATEMENT3_possibly_using_some_of_the_arguments_or/and_creating_RESULT
  ETC
  return (RESULT)
}

The role of return function in the end is to return the values into the global environment. When you run the function, it will create a temporary environment, which is not visible from the Global Environment. All variables defined inside the function are created only inside this temporary environment (so as they do not mix with those defined in the Global Environment). If you want the function to return something to the global environment, you need to return it from the function. This is done either explicitly by using return function, or by typing the name of the variable in the end of the function, as if you want to print it into the console (means that instead of return (RESULT) you may type simply RESULT and it does the same thing; the difference is that return (RESULT) can theoretically be somewhere inside the script, not necessarily in the end, while if you type only the name of variable, function returns the one which is the last one). You can return only a single object, so if more than one object needs to be returned, wrap them e.g. into a list and return the list of objects. If return is used in the body of the function but is not in the end of the script, all lines of code after return are ignored and not executed.

Plot into a file

tiff (filename = 'FILENAME.tiff', width = WIDTH, height = HEIGHT, units = UNITS, pointsize = POINTSIZE, ...)
PLOT_THE_FIGURE
dev.off ()

The logic is the following:

  1. To plot into a file, you need to first open appropriate graphics device (tiff opens TIFF graphics device, but there is a number of others, e.g. bmp, jpeg, png, pdf, postscript etc.). Give the file a name with appropriate extension (e.g. *.tif in the case of TIFF format, *.jpg or *.jpeg in the case of JPG, etc) and set up plotting parameters (size of the image by WIDTH or/and HEIGHT, etc.).
  2. Plot the figure itself (use high-level plotting functions like plot, boxplot, barplot, hist etc., and optionally add low-level plotting functions, such as points, text, lines, axis, box, etc.).
  3. When you finish plotting, close the graphics device by dev.off () function, to 1) save the file, 2) to return to the R graphics device. If you forgot to close the device by dev.off (), you cannot open the saved graphical file (and it gets zero bit size). If this happen, type dev.off () several times, untill you see the message null device.

Plot regression line/curve into the scatterplot

## A) Linear regression with linear regression line
# 1) fit the model:
LM <- lm (Y ~ X, data = DATA.FRAME)
FOR.PREDICT <- c(min (DATA.FRAME$X), max (DATA.FRAME$X))
PREDICTED.VALS <- predict (LM, newdata = list (X = FOR.PREDICT))
# 2) plot the data:
plot (Y ~ X, data = DATA.FRAME)
lines (PREDICTED.VALS ~ FOR.PREDICT)

## B) Linear regression with non-linear (polynomial) regression line
# 1) fit the model:
LM2 <- lm (Y ~ poly (X, 2), data = DATA.FRAME)
FOR.PREDICT <- seq (from = min (DATA.FRAME$X), to = max (DATA.FRAME$X), length = 100)
PREDICTED.VALS <- predict (LM2, newdata = list (X = FOR.PREDICT))
# 2) plot the data:
plot (Y ~ X, data = DATA.FRAME)
lines (PREDICTED.VALS ~ FOR.PREDICT)

## C) "abline" solution (only for linear regression line)
# 1) fit the model:
LM <- lm (Y ~ X, data = DATA.FRAME)
# 2) plot the data:
plot (Y ~ X, data = DATA.FRAME)
abline (LM)

Combination of lm (linear model lm (y ~ x, data)) and predict (predicts the values of x variables lying on the regression curve), and then high-level graphical function (plot, plots scatterplot) and low-level one (lines, plots the regression line). Although it may feel a bit too complicated, it is the most general option, which allows for plotting any type of fitted curve (not only linear as abline, but also non-linear).

Note: the function predict, if without argument newdata, will return the vector of predicted values for all values in variable x in the model. This may not always be useful, especially if the values in x are not sorted from lowest to highest, and we want to plot the curve (will produce the scatter of lines instead). Also, make sure that the values going to newdata argument of predict function is a list, containing the variable with exactly the same name as in the regression model (X in the example above), otherwise it won't work (it will behave as if the newdata argument is not supplied, without any warning).

The abline solution may seem as the simplest, but it is the least general - it plots only linear regression line (no non-linear shape), and it plots the curve always across the whole figure space, not only across the span of the data points. This may sometimes be fine, sometimes not, depends.

Monte Carlo permutation test of linear regression

# X - independent (explanatory) variable (e.g. X <- cars$speed)
# Y - dependent variable (e.g. Y <- cars$dist)
# NPERM - number of permutations (e.g. NPERM <- 999)

# 1) Plot observed data, fit the linear regression, and calculate r2 (observed r2)
plot (Y ~ X)
abline (lm (Y ~ X))
LM <- lm (Y ~ X)
R2_OBS <- summary (LM)$r.squared

# 2) calculate r2 of regression on the original variables after randomizing one of them (randomized r2);
# repeat NPERM-times
R2_RAND <- vector ('numeric', length = NPERM)
for (perm in seq (1, NPERM))
{
  LM_TEMP <- lm (sample (Y) ~ X)
  R2_RAND[perm] <- summary (LM_TEMP)$r.squared
}

# 3) merge the randomized r2 and observed r2 into one vector
R2_ALL <- c (R2_RAND, R2_OBS)

# 4) calculate the probability that observed r2 can be generated in case that the null hypothesis
# is true (ie by regression on randomized original variables)
P <- sum (R2_ALL >= R2_OBS)/(NPERM+1)

Linear regression (lm function in R) is relationship between dependent (Y) and explanatory (X) variable, fitted by least square algorithm. The effect size (how good is the fit) of the regression is the r2 (coefficient of determination, see the definition on Wikipedia). Disadvantage of the r2 is that it is always larger than zero, even in case that the two variables regressed against each other are random. Thus, to compare whether our observed r2 is considerably (“significantly”) higher than r2 of regression on randomized variables, we need to test the significance of the relationship (i.e. to test the null hypothesis that there is no relationship between X and Y). This can be done by parametric test (F-test in case of linear regression), or by non-parametric, permutation test (Monte Carlo permutation test, detailed here). Parametric test (F-test) calculates F-value, and given number of samples it calculates the P-value. The limitation of parametric test is that it strictly applies only for situation that Y for each X has normal distribution, which is often not the case. Alternatively, one can calculate significance using Monte Carlo permutation test. This test simulates the null hypothesis situation by randomizing one of the two variables (X or Y), calculating regression, and returning the F-value, r2 or other test statistic. If this null hypothesis is replicated many times (e.g. 999 times, i.e. number of permutations), we get reasonable estimation of the test statistic (let's talk about r2 here) in case that the null hypothesis is true (i.e. the variables are random, without relationship). We can compare this distribution with the real r2 (from real data without randomization) and calculate the probability that the observed r2 originates from distribution of r2 representing the null hypothesis. If this probability (P-value) is low enough (e.g. P < 0.05), we can reasonably assume that the null hypothesis is rejected.

recol/r_constructs.txt · Last modified: 2019/10/15 00:07 by david