Software Tools / R: Homework 3: due Monday, 21 Feb, 2011

Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Mon 21 Feb at 10.00. Use R3 as the title of your message.


Exercise 1. (Using generic functions)

We continue exploring the data set tn10.csv from the probability theory course. Now we want to produce a nice printout, which shows for each student the values of the following variables

StudentNumber ExamPoints ExtraPoints Grade
Here StudentNumber is the same as the variable opisnro, ExamPoints is the sum of the points from the two course exams, ExtraPoints is calculated as in Exercise 2 from R2, and Grade (one of 0, 1, ..., 5) is determined based on the sum of ExamPoints and ExtraPoints, as follows.
Range Grade
0 <= sum < 21 0
21 <= sum < 27 1
27 <= sum < 32 2
32 <= sum < 37 3
37 <= sum < 42 4
42 <= sum 5

The results should be give so that the student numbers are in ascending order. There should not be anything besides the values of the variables on the row.

Hint: form a data frame with variable names StudentNumber, ExamPoints, ExtraPoints and Grade, sort it according to the value of StudentNumber, and print it. You should give print() some extra argument to prevent it from printing the row names.


Suggested solution:

First reuse code from problem 2 of R2.

results <- read.csv2('tn10.csv')
results$H.sum <- rowSums(results[ , 2:11], na.rm = TRUE)
results$K.sum <- rowSums(results[ , 12:19], na.rm = TRUE)

lims <- c(0, seq(10, 40, by = 5), Inf)
results$Extra.points <-
  (0:7)[cut(results$H.sum,  breaks = lims, right = FALSE)]

Then do the rest. You can find the argument row.names in the help file of the print method print.data.frame.

points <- results$K.sum + results$H.sum
grade <- (0:5)[cut(points, breaks = c(0, 21, 27, 32, 37, 42, Inf), right = FALSE)]
d <- data.frame(StudentNumber = results$opisnro,
  ExamPoints = results$K.sum, ExtraPoints = results$Extra.points,
  Grade = grade)
ind <- order(d$StudentNumber)
d <- d[ind, ]
print(d, row.names = FALSE)

Exercise 2 (Parallel boxplots)

Your instructor suspects that solving lots of exercises is correlated with getting lots of points in the exams. Now we try to explore the validity of this idea using graphical tools. First, remove those students who did not attend one of the two course exams (K11 or K21 is then missing). For the rest of the students, define a factor ExerciseActivity with levels low and high according to whether the sum of the points from the exercise sessions is <= 12 or > 12. Then produce parallel boxplots showing the distribution of the sum of the points from the two course exams within these two groups.


Suggested solution:

no.show <- is.na(results$K11) | is.na(results$K21)
results <- results[!no.show, ]
results$ExerciseActivity <-
  factor(as.numeric(results$H.sum > 12), labels = c('low', 'high'))

boxplot(K.sum ~ ExerciseActivity, data = results)
Well, the situation is not clear. There seem to be more low exam points for the students with low exercise activity, but some students with low exercise activity are still capable of achieving good points in the exams.

Exercise 3. (Several line plots in a single figure)

Write R code for producing this figure. The function sin(x)/x has been plotted on (0, 20) using a wide solid red line. The figures also contains the graphs of the functions 1/x and -1/x drawn with thin blue dashed lines. The range of the y-axis has been set to (-0.5, 1.5). The axis labels are 'x' and 'y'.


Suggested solution:

# Vectors for x and the two functions sin(x)/x and 1/x
x <- seq(0, 20, len = 401)
y1 <- sin(x) / x
y2 <- 1 / x
# One way of drawing the figure: use plot() and lines().
# We need to specify the line type, color, y-axis limits and line width.
plot(x, y1, type = 'l', ylim = c(-0.5, 1.5), lty = 'solid', col = 'red',
lwd = 2, ylab = 'y')
lines(x, y2, lty = 'dashed', col = 'blue')
lines(x, -y2, lty = 'dashed', col = 'blue')

# The figure can also be produced with a single call of matplot().
matplot(cbind(x, x, x), cbind(y1, y2, -y2), type = 'l',
ylim = c(-0.5, 1.5),
lty = c('solid', 'dashed', 'dashed'), col = c('red', 'blue', 'blue'),
lwd = c(2, 1, 1), xlab = 'x', ylab = 'y')

Exercise 4. (Comparing a histogram with a density function)

A textbook claims that the following method simulates n values from the t distribution with nu = 4 degrees of freedom (without using the function rt).

n <- 1000
nu <- 4
y <- rgamma(n, nu/2, nu/2)
x <- rnorm(n, sd = 1 / sqrt(y))

(Actually, it would be much easier to simulate the t distribution with the call rt(n, df=4) instead.) To check this claim, plot the probability density histogram of the simulated x values and the graph of the t density (function dt()) in the same figure.

Draw the histogram with hist(), but use more bins than what hist() gives you by default, by specifying the number of bins with the argument breaks. Specify that you want a probability density histogram instead of a frequency histogram using the argument probability = TRUE. Specify the axis limits so that both the histogram and the density function fit nicely on the plotting area, and give the plot a meaningful title and axis labels.


Suggested solution:
hist(x, prob = TRUE, breaks = 40, main = '',
  xlim = c(-10, 10), ylim = c(0, 0.4))
u <- seq(-10, 10, len = 401)
lines(u, dt(u, df = nu), col = 'red')
title(xlab = 'x', ylab = 'Density',
  main = 'Histogram vs. true density function')

Exercise 5. (Filling in an area under the graph of a function)

This figure demonstrates the critical region of a F-test. It contains the graph of the density of the F-distribution with degrees of freedom parameters df1 = 3 and df2 = 15. You can evaluate the value of the density at x with df(x, df1, df2). The area in the right-hand tail of the distribution, where proportion 0.05 of the probability mass lies, has been colored red. The critical point can be calculated with

qf(0.05, df1, df2, lower.tail = FALSE)

Write R code for producing the figure. Use polygon(..., col = "red") for filling in the tail area. Draw first the graph of the pdf and then define the vertices defining the polygon to be filled in. (It is difficult to use exactly the same (x, y) points for both purposes).


Suggested solution:
df1 <- 3
df2 <- 15
x <- seq(0, 8, len = 401)
u <- qf(0.05, df1 = df1, df2 = df2, lower.tail = FALSE)
# This is the graph of the density:
plot(x, df(x, df1 = df1, df2 = df2), type = 'l', ann = FALSE)
# Now the code for filling the right-hand tail area
xx <- seq(u, 8, len = 401)
polygon(x = c(xx[1], xx), y = c(0, df(xx, df1, df2)), col = 'red')
# The figure looks better, if I fill the empty part of the x-axis:
lines(c(0, u), c(0, 0))

Exercise 6. (Visualizing tables)

We now try to visualize the frequency distributions of the points of the first course exam of the probability theory course. There were four problems graded 0—6, and the file lists for each of the participants the points that person obtaind from the problems.

Calculate an array arr, whose columns you should give names 'p1', ..., 'p4', rows names '0', ..., '6', and, e.g., arr['3', 'p1'] should contain the number of participants who obtained three points from problem one. Once you have done that, try what you get with

barplot(arr, beside = TRUE, legend.text = paste(0:6))

Try also the visualization given by dotchart(arr).

The easiest way to calculate the frequencies for arr is to use nested for-loops (the brute force solution). Alternatively, you can calculate the frequencies with an elegant one-liner where you might use functions such as sapply(), table() and factor() (but this is much more difficult to write).


Suggested solution:
results <- read.csv2('tn10.csv')
# Convert missing values to zeros:
results[is.na(results)] <- 0
# Optionally: remove those students who did not come to the exams
results <- results[!is.na(results$K11) & !is.na(results$K21), ]

# Points from the first course exam are at postions results[ , 12:15]
# Calculating arr with nested for-loops:
arr <- matrix(0, nrow = 7, ncol = 4)
for (i in 0:6)
  for (j in 1:4) arr[i + 1, j] <- sum(results[, 11+j] == i)

arr
rownames(arr) <- paste(0:6)
colnames(arr) <- paste('p', 1:4, sep = '')
arr

# A more elegant solution:
arr <- sapply(results[ , 12:15], function(x) table(factor(x, levels = 0:6)))
colnames(arr) <- paste('p', 1:4, sep = '')
arr
# The simpler call
# sapply(results[, 12:15], function(x) table(x))
# would not work if there was a zero count in some cell of the table.

# Visualizations:
barplot(arr, beside = TRUE, legend.text = paste(0:6))
dotchart(arr)

Last updated 2011-02-21 11:09
Petri Koistinen
petri.koistinen 'at' helsinki.fi