Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Thu 3 March at 10.00. Use R4 as the title of your message.

This figure was produced by repeating 100 times the experiment,
where a sample of size 10 was generated from the standard normal
distribution. In each of the repetitions, the 95% t-confidence interval was
calculated and stored (using `t.test(x)$conf.int`

). Finally, each confidence
interval was drawn (with function `segments()`

)
so that those confidence intervals which
did not cover the true value zero were colored red.

Write R code for producing a similar figure.

Suggested solution:

n <- 10 rep <- 100 ci.left <- numeric(rep) ci.right <- numeric(rep) for (i in 1:rep) { x <- rnorm(10) ci <- t.test(x)$conf.int ci.left[i] <- ci[1] ci.right[i] <- ci[2] } does.not.cover <- (ci.left > 0) | (ci.right < 0) # limits for the x-axis: a <- min(ci.left) b <- max(ci.right) # Draw the axes: plot(0, 0, type = 'n', xlim = c(a, b), ylim = c(0, 100), ann = FALSE) # Draw line segments; let vector does.not.cover select the color. segments(ci.left, 1:rep, ci.right, 1:rep, col = ifelse(does.not.cover, 'red', 'black')) # Finally, draw a vertical line to show the population parameter: abline(v = 0)

We continue exploring the data set tn10.csv
from the probability theory course.
We try to investigate the hypothesis whether the decision of not
coming to the first course exam (`K11`

is missing) and low exericse
activity at the beginning of the course (`H1+H2+H3+H4+H5`

< 6, and
missing values should be intepreted as zeros) are correlated. Test the
null hypothesis that these classifications of the students are
independent with the chi squared test. Do you reject the null
hypothesis with significance level 0.05?

Hints: use function `chisq.test()`

and form its input by using the
function `table()`

. You can use as inputs to `table()`

two logical vectors
which correspond to the row and column classifications.

Suggested solution:

results <- read.csv2('tn10.csv') no.show <- is.na(results$K11) low.exercise <- rowSums(results[ , c('H1', 'H2', 'H3', 'H4', 'H5')], na.rm = TRUE) < 6 (tt <- table(no.show, low.exercise)) chisq.test(tt)The p-value is very low; the null hypothesis would be rejected not only at the 0.05 significance level but even at the 0.001 significance level.

The bilirubin data bilirubin.dat, contains bilirubin concentration measurements for three individuals. The measurements from a single individual form one of the three groups. Now we draw a figure to investigate informally the assumption of constant variance within each group needed for an ANOVA model (analysis of variance model) for this data. First read the data and then draw a stripchart from it with the command

stripchart(bili$Concentration ~ bili$Individual, method = 'jitter', vert = TRUE, pch = 1)Next, using the command

arrows( ...arguments... , angle = 90, code = 3)add line segments, which show for each of the three individuals the 1 SEM (standard error of the mean) limits

xbar[i] +- sem[i]where xbar[i] is the mean of the bilirubin concentrations for the i'th individual, and sem[i] is the corresponding standard error of the mean, which is equal to the standard deviation of the concentration measurements for the i'th individual divided by the square root of the number of the concentration measurements for the i'th individual.

Suggested solution:

bili <- read.table('bilirubin.dat', header = TRUE) xbar <- tapply(bili$Concentration, bili$Individual, mean) sem <- tapply(bili$Concentration, bili$Individual, function(x) sd(x) / sqrt(length(x))) stripchart(bili$Concentration ~ bili$Individual, method = 'jitter', vert = TRUE, pch = 1) arrows(1:3, xbar - sem, 1:3, xbar + sem, angle = 90, code = 3)

Using the `LifeCycleSavings`

data set of R, first fit a linear model, which
tries to predict the variable `sr`

using the other variables of the
data frame as explanatory variables. Then calculate the following.

- A 95% confidence interval for the regression coefficient of the
variable
`ddpi`

. - A 95% confidence interval for the expected value of variable
`sr`

for a new country with pop15 equal to 28, pop75 equal to 2.4, dpi equal to 1700 and ddpi equal to 4.3.

Suggested solution:

data(LifeCycleSavings) m <- lm(sr ~ ., data = LifeCycleSavings) # confidence interval for beta of ddpi confint(m, 'ddpi') # Confidence interval for the new country; # the trick is to predict in a new data frame. predict(m, interval = 'confidence', newdata = data.frame(pop15 = 28, pop75 = 2.4, dpi = 1700, ddpi = 4.3))

The file trees.dat contains measurements of the diameter (d) and height (h) and volume (v) of black cherry trees (given in exotic units). While it is easy to measure the diameter and height of a growing tree, measuring its volume directly (without cutting it down) is not easy. Therefore it is of interest to try to predict the volume of a tree on the basis of its diameter and its height.

First try to explain v with a linear model, where the explanatory variables are h and d. Next draw the following residual plots

- residuals vs. the fitted values
- the following two plots: residuals vs. each of the explanatory variables

Convince yourself that this model is not good, since some of the residual plots exhibit a clear systematic pattern.

Next try to explain log(v) with a linear model, where the explanatory variables are log(h) and log(d). Draw again the residual plots and notice that the systematic structure has now disappeared.

Suggested solution:

trees <- read.table('trees.dat', header = TRUE) m1 <- lm(v ~ h + d, data = trees) plot(fitted(m1), resid(m1)) # ... suspicious, smile-like pattern plot(trees$d, resid(m1)) # ... suspicious, smile-like pattern plot(trees$h, resid(m1)) # ... no obvious pattern m2 <- lm(log(v) ~ log(h) + log(d), data = trees) plot(fitted(m2), resid(m2)) plot(log(trees$d), resid(m2)) plot(log(trees$h), resid(m1)) # ... no obivous patterns # Last, I demonstrate how one could form prediction intervals for the # volume based on the linear model for log(v); # the key is to form a prediction interval first for log(v) and then # to transform it using exp() which is the inverse function of log(); # predict() applies the transformations to the explanatory variables # automatically. ln.v.pred <- predict(m2, data.frame(d = 11, h = 70), interval = "prediction") exp(ln.v.pred)

We continue with the trees data trees.dat. First fit the full linear model, where the response variable is log(v) and the explanatory variables are log(h) and log(d).

A simple model for connecting the (expected) values of the three quantities is the following

v = constant * h * d^2Taking logarithms, you arrive at a reduced linear model, which is a special case of the full linear model.

Perform the F-test, where you compare the reduced linear model to the full linear model.

Suggested solution: Taking logarithms, and adding an error term, we obtain the reduced linear model

log(v) = log(constant) + log(h) + 2 * log(d) + epsilonwhich we can test as follows.

trees <- read.table('trees.dat', header = TRUE) m.full <- lm(log(v) ~ log(h) + log(d), data = trees) m.reduced <- lm(log(v) ~ offset(log(h) + 2 * log(d)), data = trees) anova(m.reduced, m.full)The p-value of the test is 0.85, and so there is really no evidence for rejecting the reduced model.

Last updated 2011-03-03 12:20

Petri Koistinen

petri.koistinen 'at' helsinki.fi