Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Mon 21 Feb at 10.00. Use R3 as the title of your message.

We continue exploring the data set tn10.csv from the probability theory course. Now we want to produce a nice printout, which shows for each student the values of the following variables

StudentNumber ExamPoints ExtraPoints GradeHere StudentNumber is the same as the variable opisnro, ExamPoints is the sum of the points from the two course exams, ExtraPoints is calculated as in Exercise 2 from R2, and Grade (one of 0, 1, ..., 5) is determined based on the sum of ExamPoints and ExtraPoints, as follows.

Range | Grade |
---|---|

0 <= sum < 21 | 0 |

21 <= sum < 27 | 1 |

27 <= sum < 32 | 2 |

32 <= sum < 37 | 3 |

37 <= sum < 42 | 4 |

42 <= sum | 5 |

The results should be give so that the student numbers are in ascending order. There should not be anything besides the values of the variables on the row.

Hint: form a data frame with variable names
StudentNumber, ExamPoints, ExtraPoints and Grade, sort it according to
the value of StudentNumber, and print it.
You should give `print()`

some extra argument to prevent it from
printing the row names.

Suggested solution:

First reuse code from problem 2 of R2.

results <- read.csv2('tn10.csv') results$H.sum <- rowSums(results[ , 2:11], na.rm = TRUE) results$K.sum <- rowSums(results[ , 12:19], na.rm = TRUE) lims <- c(0, seq(10, 40, by = 5), Inf) results$Extra.points <- (0:7)[cut(results$H.sum, breaks = lims, right = FALSE)]

Then do the rest. You can find the argument `row.names`

in the help file
of
the print method `print.data.frame`

.

points <- results$K.sum + results$H.sum grade <- (0:5)[cut(points, breaks = c(0, 21, 27, 32, 37, 42, Inf), right = FALSE)] d <- data.frame(StudentNumber = results$opisnro, ExamPoints = results$K.sum, ExtraPoints = results$Extra.points, Grade = grade) ind <- order(d$StudentNumber) d <- d[ind, ] print(d, row.names = FALSE)

Your instructor suspects that solving lots of exercises is correlated
with getting lots of points in the exams.
Now we try to explore the validity of this
idea using
graphical tools. First, remove those students who did not attend one
of the two course exams (`K11`

or `K21`

is then missing).
For the rest of the students,
define a factor
`ExerciseActivity`

with levels `low`

and `high`

according to whether
the sum of the points from the exercise sessions is <= 12 or > 12.
Then produce parallel boxplots showing the distribution of the sum
of the points
from the two course exams within these two groups.

Suggested solution:

no.show <- is.na(results$K11) | is.na(results$K21) results <- results[!no.show, ] results$ExerciseActivity <- factor(as.numeric(results$H.sum > 12), labels = c('low', 'high')) boxplot(K.sum ~ ExerciseActivity, data = results)Well, the situation is not clear. There seem to be more low exam points for the students with low exercise activity, but some students with low exercise activity are still capable of achieving good points in the exams.

Write R code for producing this figure. The function sin(x)/x has been plotted on (0, 20) using a wide solid red line. The figures also contains the graphs of the functions 1/x and -1/x drawn with thin blue dashed lines. The range of the y-axis has been set to (-0.5, 1.5). The axis labels are 'x' and 'y'.

Suggested solution:

# Vectors for x and the two functions sin(x)/x and 1/x x <- seq(0, 20, len = 401) y1 <- sin(x) / x y2 <- 1 / x # One way of drawing the figure: use plot() and lines(). # We need to specify the line type, color, y-axis limits and line width. plot(x, y1, type = 'l', ylim = c(-0.5, 1.5), lty = 'solid', col = 'red', lwd = 2, ylab = 'y') lines(x, y2, lty = 'dashed', col = 'blue') lines(x, -y2, lty = 'dashed', col = 'blue') # The figure can also be produced with a single call of matplot(). matplot(cbind(x, x, x), cbind(y1, y2, -y2), type = 'l', ylim = c(-0.5, 1.5), lty = c('solid', 'dashed', 'dashed'), col = c('red', 'blue', 'blue'), lwd = c(2, 1, 1), xlab = 'x', ylab = 'y')

A textbook claims that the
following method simulates `n`

values from the
t distribution with nu = 4 degrees of freedom
(without using the function `rt`

).

n <- 1000 nu <- 4 y <- rgamma(n, nu/2, nu/2) x <- rnorm(n, sd = 1 / sqrt(y))

(Actually, it would be much easier to simulate the t distribution
with the call
`rt(n, df=4)`

instead.)
To check this claim,
plot the probability density histogram of the
simulated `x`

values and the graph of the t density
(function `dt()`

)
in the same figure.

Draw the histogram with `hist()`

, but use more bins than
what `hist()`

gives you by
default, by specifying the number of bins with the argument `breaks`

.
Specify that you want a probability density histogram instead of
a frequency histogram using the argument `probability = TRUE`

.
Specify the axis limits so that both the histogram and the
density function fit nicely on the plotting area, and
give the plot a meaningful title
and axis labels.

Suggested solution:

hist(x, prob = TRUE, breaks = 40, main = '', xlim = c(-10, 10), ylim = c(0, 0.4)) u <- seq(-10, 10, len = 401) lines(u, dt(u, df = nu), col = 'red') title(xlab = 'x', ylab = 'Density', main = 'Histogram vs. true density function')

This figure demonstrates the critical region of a F-test.
It contains the graph of the density of the F-distribution
with degrees of freedom parameters
`df1`

= 3 and `df2`

= 15. You can evaluate the value of the
density at `x`

with `df(x, df1, df2)`

.
The area in the right-hand tail of the distribution, where
proportion 0.05 of the probability mass lies, has been colored red.
The critical point can be calculated with

qf(0.05, df1, df2, lower.tail = FALSE)

Write R code for producing the figure. Use polygon(..., col = "red") for filling in the tail area. Draw first the graph of the pdf and then define the vertices defining the polygon to be filled in. (It is difficult to use exactly the same (x, y) points for both purposes).

Suggested solution:

df1 <- 3 df2 <- 15 x <- seq(0, 8, len = 401) u <- qf(0.05, df1 = df1, df2 = df2, lower.tail = FALSE) # This is the graph of the density: plot(x, df(x, df1 = df1, df2 = df2), type = 'l', ann = FALSE) # Now the code for filling the right-hand tail area xx <- seq(u, 8, len = 401) polygon(x = c(xx[1], xx), y = c(0, df(xx, df1, df2)), col = 'red') # The figure looks better, if I fill the empty part of the x-axis: lines(c(0, u), c(0, 0))

We now try to visualize the frequency distributions of the points of the first course exam of the probability theory course. There were four problems graded 0—6, and the file lists for each of the participants the points that person obtaind from the problems.

Calculate an array `arr`

, whose columns you should give
names 'p1', ..., 'p4', rows names '0', ..., '6', and, e.g.,
`arr['3', 'p1']`

should contain the number of participants who obtained
three points from problem one.
Once you have done that, try what you get with

barplot(arr, beside = TRUE, legend.text = paste(0:6))

Try also the visualization given by `dotchart(arr)`

.

The easiest way to
calculate the frequencies for `arr`

is to use nested for-loops
(the brute force solution).
Alternatively, you
can
calculate the frequencies with an elegant one-liner where you might
use functions such as
`sapply()`

, `table()`

and `factor()`

(but this is much more difficult to write).

Suggested solution:

results <- read.csv2('tn10.csv') # Convert missing values to zeros: results[is.na(results)] <- 0 # Optionally: remove those students who did not come to the exams results <- results[!is.na(results$K11) & !is.na(results$K21), ] # Points from the first course exam are at postions results[ , 12:15] # Calculating arr with nested for-loops: arr <- matrix(0, nrow = 7, ncol = 4) for (i in 0:6) for (j in 1:4) arr[i + 1, j] <- sum(results[, 11+j] == i) arr rownames(arr) <- paste(0:6) colnames(arr) <- paste('p', 1:4, sep = '') arr # A more elegant solution: arr <- sapply(results[ , 12:15], function(x) table(factor(x, levels = 0:6))) colnames(arr) <- paste('p', 1:4, sep = '') arr # The simpler call # sapply(results[, 12:15], function(x) table(x)) # would not work if there was a zero count in some cell of the table. # Visualizations: barplot(arr, beside = TRUE, legend.text = paste(0:6)) dotchart(arr)

Last updated 2011-02-21 11:09

Petri Koistinen

petri.koistinen 'at' helsinki.fi