Software Tools / R: Homework 3: due Monday, 21 Feb, 2011

Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Mon 21 Feb at 10.00. Use R3 as the title of your message.


Exercise 1. (Using generic functions)

We continue exploring the data set tn10.csv from the probability theory course. Now we want to produce a nice printout, which shows for each student the values of the following variables

StudentNumber ExamPoints ExtraPoints Grade
Here StudentNumber is the same as the variable opisnro, ExamPoints is the sum of the points from the two course exams, ExtraPoints is calculated as in Exercise 2 from R2, and Grade (one of 0, 1, ..., 5) is determined based on the sum of ExamPoints and ExtraPoints, as follows.
Range Grade
0 <= sum < 21 0
21 <= sum < 27 1
27 <= sum < 32 2
32 <= sum < 37 3
37 <= sum < 42 4
42 <= sum 5

The results should be give so that the student numbers are in ascending order. There should not be anything besides the values of the variables on the row.

Hint: form a data frame with variable names StudentNumber, ExamPoints, ExtraPoints and Grade, sort it according to the value of StudentNumber, and print it. You should give print() some extra argument to prevent it from printing the row names.


Exercise 2 (Parallel boxplots)

Your instructor suspects that solving lots of exercises is correlated with getting lots of points in the exams. Now we try to explore the validity of this idea using graphical tools. First, remove those students who did not attend one of the two course exams (K11 or K21 is then missing). For the rest of the students, define a factor ExerciseActivity with levels low and high according to whether the sum of the points from the exercise sessions is <= 12 or > 12. Then produce parallel boxplots showing the distribution of the sum of the points from the two course exams within these two groups.


Exercise 3. (Several line plots in a single figure)

Write R code for producing this figure. The function sin(x)/x has been plotted on (0, 20) using a wide solid red line. The figures also contains the graphs of the functions 1/x and -1/x drawn with thin blue dashed lines. The range of the y-axis has been set to (-0.5, 1.5). The axis labels are 'x' and 'y'.


Exercise 4. (Comparing a histogram with a density function)

A textbook claims that the following method simulates n values from the t distribution with nu = 4 degrees of freedom (without using the function rt).

n <- 1000
nu <- 4
y <- rgamma(n, nu/2, nu/2)
x <- rnorm(n, sd = 1 / sqrt(y))

(Actually, it would be much easier to simulate the t distribution with the call rt(n, df=4) instead.) To check this claim, plot the probability density histogram of the simulated x values and the graph of the t density (function dt()) in the same figure.

Draw the histogram with hist(), but use more bins than what hist() gives you by default, by specifying the number of bins with the argument breaks. Specify that you want a probability density histogram instead of a frequency histogram using the argument probability = TRUE. Specify the axis limits so that both the histogram and the density function fit nicely on the plotting area, and give the plot a meaningful title and axis labels.


Exercise 5. (Filling in an area under the graph of a function)

This figure demonstrates the critical region of a F-test. It contains the graph of the density of the F-distribution with degrees of freedom parameters df1 = 3 and df2 = 15. You can evaluate the value of the density at x with df(x, df1, df2). The area in the right-hand tail of the distribution, where proportion 0.05 of the probability mass lies, has been colored red. The critical point can be calculated with

qf(0.05, df1, df2, lower.tail = FALSE)

Write R code for producing the figure. Use polygon(..., col = "red") for filling in the tail area. Draw first the graph of the pdf and then define the vertices defining the polygon to be filled in. (It is difficult to use exactly the same (x, y) points for both purposes).


Exercise 6. (Visualizing tables)

We now try to visualize the frequency distributions of the points of the first course exam of the probability theory course. There were four problems graded 0—6, and the file lists for each of the participants the points that person obtaind from the problems.

Calculate an array arr, whose columns you should give names 'p1', ..., 'p4', rows names '0', ..., '6', and, e.g., arr['3', 'p1'] should contain the number of participants who obtained three points from problem one. Once you have done that, try what you get with

barplot(arr, beside = TRUE, legend.text = paste(0:6))

Try also the visualization given by dotchart(arr).

The easiest way to calculate the frequencies for arr is to use nested for-loops (the brute force solution). Alternatively, you can calculate the frequencies with an elegant one-liner where you might use functions such as sapply(), table() and factor() (but this is much more difficult to write).


Last updated 2011-02-14 10:10
Petri Koistinen
petri.koistinen 'at' helsinki.fi