Send your solutions to the lecturer by email (email@example.com), at the latest on Mon 21 Feb at 10.00. Use R3 as the title of your message.
We continue exploring the data set tn10.csv from the probability theory course. Now we want to produce a nice printout, which shows for each student the values of the following variables
StudentNumber ExamPoints ExtraPoints GradeHere StudentNumber is the same as the variable opisnro, ExamPoints is the sum of the points from the two course exams, ExtraPoints is calculated as in Exercise 2 from R2, and Grade (one of 0, 1, ..., 5) is determined based on the sum of ExamPoints and ExtraPoints, as follows.
|0 <= sum < 21||0|
|21 <= sum < 27||1|
|27 <= sum < 32||2|
|32 <= sum < 37||3|
|37 <= sum < 42||4|
|42 <= sum||5|
The results should be give so that the student numbers are in ascending order. There should not be anything besides the values of the variables on the row.
Hint: form a data frame with variable names
StudentNumber, ExamPoints, ExtraPoints and Grade, sort it according to
the value of StudentNumber, and print it.
You should give
print() some extra argument to prevent it from
printing the row names.
First reuse code from problem 2 of R2.
results <- read.csv2('tn10.csv') results$H.sum <- rowSums(results[ , 2:11], na.rm = TRUE) results$K.sum <- rowSums(results[ , 12:19], na.rm = TRUE) lims <- c(0, seq(10, 40, by = 5), Inf) results$Extra.points <- (0:7)[cut(results$H.sum, breaks = lims, right = FALSE)]
Then do the rest. You can find the argument
row.names in the help file
the print method
points <- results$K.sum + results$H.sum grade <- (0:5)[cut(points, breaks = c(0, 21, 27, 32, 37, 42, Inf), right = FALSE)] d <- data.frame(StudentNumber = results$opisnro, ExamPoints = results$K.sum, ExtraPoints = results$Extra.points, Grade = grade) ind <- order(d$StudentNumber) d <- d[ind, ] print(d, row.names = FALSE)
Your instructor suspects that solving lots of exercises is correlated
with getting lots of points in the exams.
Now we try to explore the validity of this
graphical tools. First, remove those students who did not attend one
of the two course exams (
K21 is then missing).
For the rest of the students,
define a factor
ExerciseActivity with levels
high according to whether
the sum of the points from the exercise sessions is <= 12 or > 12.
Then produce parallel boxplots showing the distribution of the sum
of the points
from the two course exams within these two groups.
no.show <- is.na(results$K11) | is.na(results$K21) results <- results[!no.show, ] results$ExerciseActivity <- factor(as.numeric(results$H.sum > 12), labels = c('low', 'high')) boxplot(K.sum ~ ExerciseActivity, data = results)Well, the situation is not clear. There seem to be more low exam points for the students with low exercise activity, but some students with low exercise activity are still capable of achieving good points in the exams.
Write R code for producing this figure. The function sin(x)/x has been plotted on (0, 20) using a wide solid red line. The figures also contains the graphs of the functions 1/x and -1/x drawn with thin blue dashed lines. The range of the y-axis has been set to (-0.5, 1.5). The axis labels are 'x' and 'y'.
# Vectors for x and the two functions sin(x)/x and 1/x x <- seq(0, 20, len = 401) y1 <- sin(x) / x y2 <- 1 / x # One way of drawing the figure: use plot() and lines(). # We need to specify the line type, color, y-axis limits and line width. plot(x, y1, type = 'l', ylim = c(-0.5, 1.5), lty = 'solid', col = 'red', lwd = 2, ylab = 'y') lines(x, y2, lty = 'dashed', col = 'blue') lines(x, -y2, lty = 'dashed', col = 'blue') # The figure can also be produced with a single call of matplot(). matplot(cbind(x, x, x), cbind(y1, y2, -y2), type = 'l', ylim = c(-0.5, 1.5), lty = c('solid', 'dashed', 'dashed'), col = c('red', 'blue', 'blue'), lwd = c(2, 1, 1), xlab = 'x', ylab = 'y')
A textbook claims that the
following method simulates
n values from the
t distribution with nu = 4 degrees of freedom
(without using the function
n <- 1000 nu <- 4 y <- rgamma(n, nu/2, nu/2) x <- rnorm(n, sd = 1 / sqrt(y))
(Actually, it would be much easier to simulate the t distribution
with the call
rt(n, df=4) instead.)
To check this claim,
plot the probability density histogram of the
x values and the graph of the t density
in the same figure.
Draw the histogram with
hist(), but use more bins than
hist() gives you by
default, by specifying the number of bins with the argument
Specify that you want a probability density histogram instead of
a frequency histogram using the argument
probability = TRUE.
Specify the axis limits so that both the histogram and the
density function fit nicely on the plotting area, and
give the plot a meaningful title
and axis labels.
hist(x, prob = TRUE, breaks = 40, main = '', xlim = c(-10, 10), ylim = c(0, 0.4)) u <- seq(-10, 10, len = 401) lines(u, dt(u, df = nu), col = 'red') title(xlab = 'x', ylab = 'Density', main = 'Histogram vs. true density function')
This figure demonstrates the critical region of a F-test.
It contains the graph of the density of the F-distribution
with degrees of freedom parameters
df1 = 3 and
df2 = 15. You can evaluate the value of the
df(x, df1, df2).
The area in the right-hand tail of the distribution, where
proportion 0.05 of the probability mass lies, has been colored red.
The critical point can be calculated with
qf(0.05, df1, df2, lower.tail = FALSE)
Write R code for producing the figure. Use polygon(..., col = "red") for filling in the tail area. Draw first the graph of the pdf and then define the vertices defining the polygon to be filled in. (It is difficult to use exactly the same (x, y) points for both purposes).
df1 <- 3 df2 <- 15 x <- seq(0, 8, len = 401) u <- qf(0.05, df1 = df1, df2 = df2, lower.tail = FALSE) # This is the graph of the density: plot(x, df(x, df1 = df1, df2 = df2), type = 'l', ann = FALSE) # Now the code for filling the right-hand tail area xx <- seq(u, 8, len = 401) polygon(x = c(xx, xx), y = c(0, df(xx, df1, df2)), col = 'red') # The figure looks better, if I fill the empty part of the x-axis: lines(c(0, u), c(0, 0))
We now try to visualize the frequency distributions of the points of the first course exam of the probability theory course. There were four problems graded 0—6, and the file lists for each of the participants the points that person obtaind from the problems.
Calculate an array
arr, whose columns you should give
names 'p1', ..., 'p4', rows names '0', ..., '6', and, e.g.,
arr['3', 'p1'] should contain the number of participants who obtained
three points from problem one.
Once you have done that, try what you get with
barplot(arr, beside = TRUE, legend.text = paste(0:6))
Try also the visualization given by
The easiest way to
calculate the frequencies for
arr is to use nested for-loops
(the brute force solution).
calculate the frequencies with an elegant one-liner where you might
use functions such as
factor() (but this is much more difficult to write).
results <- read.csv2('tn10.csv') # Convert missing values to zeros: results[is.na(results)] <- 0 # Optionally: remove those students who did not come to the exams results <- results[!is.na(results$K11) & !is.na(results$K21), ] # Points from the first course exam are at postions results[ , 12:15] # Calculating arr with nested for-loops: arr <- matrix(0, nrow = 7, ncol = 4) for (i in 0:6) for (j in 1:4) arr[i + 1, j] <- sum(results[, 11+j] == i) arr rownames(arr) <- paste(0:6) colnames(arr) <- paste('p', 1:4, sep = '') arr # A more elegant solution: arr <- sapply(results[ , 12:15], function(x) table(factor(x, levels = 0:6))) colnames(arr) <- paste('p', 1:4, sep = '') arr # The simpler call # sapply(results[, 12:15], function(x) table(x)) # would not work if there was a zero count in some cell of the table. # Visualizations: barplot(arr, beside = TRUE, legend.text = paste(0:6)) dotchart(arr)
Last updated 2011-02-21 11:09
petri.koistinen 'at' helsinki.fi