Software Tools / R: Homework 4: due Thursday, 3 March, 2011

Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Thu 3 March at 10.00. Use R4 as the title of your message.


Exercise 1. (Illustrating the concept of a confidence interval by simulation)

This figure was produced by repeating 100 times the experiment, where a sample of size 10 was generated from the standard normal distribution. In each of the repetitions, the 95% t-confidence interval was calculated and stored (using t.test(x)$conf.int). Finally, each confidence interval was drawn (with function segments()) so that those confidence intervals which did not cover the true value zero were colored red.

Write R code for producing a similar figure.


Suggested solution:

n <- 10
rep <- 100
ci.left <- numeric(rep)
ci.right <- numeric(rep)

for (i in 1:rep) {
  x <- rnorm(10)
  ci <- t.test(x)$conf.int
  ci.left[i] <- ci[1]
  ci.right[i] <- ci[2]
}
does.not.cover <- (ci.left > 0) | (ci.right < 0)

# limits for the x-axis:
a <- min(ci.left)
b <- max(ci.right)
# Draw the axes:
plot(0, 0, type = 'n', xlim = c(a, b), ylim = c(0, 100), ann = FALSE)
# Draw line segments; let vector does.not.cover select the color.
segments(ci.left, 1:rep, ci.right, 1:rep,
  col = ifelse(does.not.cover, 'red', 'black'))
# Finally, draw a vertical line to show the population parameter:
abline(v = 0)

Exercise 2. (Chi squared test for a table)

We continue exploring the data set tn10.csv from the probability theory course. We try to investigate the hypothesis whether the decision of not coming to the first course exam (K11 is missing) and low exericse activity at the beginning of the course (H1+H2+H3+H4+H5 < 6, and missing values should be intepreted as zeros) are correlated. Test the null hypothesis that these classifications of the students are independent with the chi squared test. Do you reject the null hypothesis with significance level 0.05?

Hints: use function chisq.test() and form its input by using the function table(). You can use as inputs to table() two logical vectors which correspond to the row and column classifications.


Suggested solution:
results <- read.csv2('tn10.csv')

no.show <- is.na(results$K11)
low.exercise <-
  rowSums(results[ , c('H1', 'H2', 'H3', 'H4', 'H5')], na.rm = TRUE) < 6

(tt <- table(no.show, low.exercise))
chisq.test(tt)
The p-value is very low; the null hypothesis would be rejected not only at the 0.05 significance level but even at the 0.001 significance level.

Exercise 3. (Graphical presentation of grouped data)

The bilirubin data bilirubin.dat, contains bilirubin concentration measurements for three individuals. The measurements from a single individual form one of the three groups. Now we draw a figure to investigate informally the assumption of constant variance within each group needed for an ANOVA model (analysis of variance model) for this data. First read the data and then draw a stripchart from it with the command

stripchart(bili$Concentration ~ bili$Individual, method = 'jitter',
  vert = TRUE, pch = 1)
Next, using the command
arrows( ...arguments... , angle = 90, code = 3)
add line segments, which show for each of the three individuals the 1 SEM (standard error of the mean) limits
xbar[i] +- sem[i]
where xbar[i] is the mean of the bilirubin concentrations for the i'th individual, and sem[i] is the corresponding standard error of the mean, which is equal to the standard deviation of the concentration measurements for the i'th individual divided by the square root of the number of the concentration measurements for the i'th individual.

Suggested solution:

bili <- read.table('bilirubin.dat', header = TRUE)
xbar <- tapply(bili$Concentration, bili$Individual, mean)
sem <- tapply(bili$Concentration, bili$Individual,
      function(x) sd(x) / sqrt(length(x)))
stripchart(bili$Concentration ~ bili$Individual, method = 'jitter',
  vert = TRUE, pch = 1)
arrows(1:3, xbar - sem, 1:3, xbar + sem, angle = 90, code = 3)

Exercise 4. (Confidence bands in multiple linear regression)

Using the LifeCycleSavings data set of R, first fit a linear model, which tries to predict the variable sr using the other variables of the data frame as explanatory variables. Then calculate the following.


Suggested solution:
data(LifeCycleSavings)
m <- lm(sr ~ ., data = LifeCycleSavings)
# confidence interval for beta of ddpi
confint(m, 'ddpi')
# Confidence interval for the new country;
# the trick is to predict in a new data frame.
predict(m, interval = 'confidence',
  newdata = data.frame(pop15 = 28, pop75 = 2.4, dpi = 1700, ddpi = 4.3))

Exercise 5. (Checking graphically assumptions of a linear model)

The file trees.dat contains measurements of the diameter (d) and height (h) and volume (v) of black cherry trees (given in exotic units). While it is easy to measure the diameter and height of a growing tree, measuring its volume directly (without cutting it down) is not easy. Therefore it is of interest to try to predict the volume of a tree on the basis of its diameter and its height.

First try to explain v with a linear model, where the explanatory variables are h and d. Next draw the following residual plots

Convince yourself that this model is not good, since some of the residual plots exhibit a clear systematic pattern.

Next try to explain log(v) with a linear model, where the explanatory variables are log(h) and log(d). Draw again the residual plots and notice that the systematic structure has now disappeared.


Suggested solution:

trees <- read.table('trees.dat', header = TRUE)
m1 <- lm(v ~ h + d, data = trees)
plot(fitted(m1), resid(m1))
# ... suspicious, smile-like pattern
plot(trees$d, resid(m1))
# ... suspicious, smile-like pattern
plot(trees$h, resid(m1))
# ... no obvious pattern

m2 <- lm(log(v) ~ log(h) + log(d), data = trees)
plot(fitted(m2), resid(m2))
plot(log(trees$d), resid(m2))
plot(log(trees$h), resid(m1))
# ... no obivous patterns

# Last, I demonstrate how one could form prediction intervals for the
# volume based on the linear model for log(v);
# the key is to form a prediction interval first for log(v) and then
# to transform it using exp() which is the inverse function of log();
# predict() applies the transformations to the explanatory variables
# automatically.
ln.v.pred <- predict(m2, data.frame(d = 11, h = 70),
  interval = "prediction")
exp(ln.v.pred)

Exercise 6. (F test)

We continue with the trees data trees.dat. First fit the full linear model, where the response variable is log(v) and the explanatory variables are log(h) and log(d).

A simple model for connecting the (expected) values of the three quantities is the following

v = constant * h * d^2
Taking logarithms, you arrive at a reduced linear model, which is a special case of the full linear model.

Perform the F-test, where you compare the reduced linear model to the full linear model.


Suggested solution: Taking logarithms, and adding an error term, we obtain the reduced linear model

log(v) = log(constant) + log(h) + 2 * log(d) + epsilon
which we can test as follows.
trees <- read.table('trees.dat', header = TRUE)
m.full <- lm(log(v) ~ log(h) + log(d), data = trees)
m.reduced <- lm(log(v) ~ offset(log(h) + 2 * log(d)), data = trees)
anova(m.reduced, m.full)
The p-value of the test is 0.85, and so there is really no evidence for rejecting the reduced model.

Last updated 2011-03-03 12:20
Petri Koistinen
petri.koistinen 'at' helsinki.fi