Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Mon 21 Feb at 10.00. Use R3 as the title of your message.

We continue exploring the data set tn10.csv from the probability theory course. Now we want to produce a nice printout, which shows for each student the values of the following variables

StudentNumber ExamPoints ExtraPoints GradeHere StudentNumber is the same as the variable opisnro, ExamPoints is the sum of the points from the two course exams, ExtraPoints is calculated as in Exercise 2 from R2, and Grade (one of 0, 1, ..., 5) is determined based on the sum of ExamPoints and ExtraPoints, as follows.

Range | Grade |
---|---|

0 <= sum < 21 | 0 |

21 <= sum < 27 | 1 |

27 <= sum < 32 | 2 |

32 <= sum < 37 | 3 |

37 <= sum < 42 | 4 |

42 <= sum | 5 |

The results should be give so that the student numbers are in ascending order. There should not be anything besides the values of the variables on the row.

Hint: form a data frame with variable names
StudentNumber, ExamPoints, ExtraPoints and Grade, sort it according to
the value of StudentNumber, and print it.
You should give `print()`

some extra argument to prevent it from
printing the row names.

Your instructor suspects that solving lots of exercises is correlated
with getting lots of points in the exams.
Now we try to explore the validity of this
idea using
graphical tools. First, remove those students who did not attend one
of the two course exams (`K11`

or `K21`

is then missing).
For the rest of the students,
define a factor
`ExerciseActivity`

with levels `low`

and `high`

according to whether
the sum of the points from the exercise sessions is <= 12 or > 12.
Then produce parallel boxplots showing the distribution of the sum
of the points
from the two course exams within these two groups.

Write R code for producing this figure. The function sin(x)/x has been plotted on (0, 20) using a wide solid red line. The figures also contains the graphs of the functions 1/x and -1/x drawn with thin blue dashed lines. The range of the y-axis has been set to (-0.5, 1.5). The axis labels are 'x' and 'y'.

A textbook claims that the
following method simulates `n`

values from the
t distribution with nu = 4 degrees of freedom
(without using the function `rt`

).

n <- 1000 nu <- 4 y <- rgamma(n, nu/2, nu/2) x <- rnorm(n, sd = 1 / sqrt(y))

(Actually, it would be much easier to simulate the t distribution
with the call
`rt(n, df=4)`

instead.)
To check this claim,
plot the probability density histogram of the
simulated `x`

values and the graph of the t density
(function `dt()`

)
in the same figure.

Draw the histogram with `hist()`

, but use more bins than
what `hist()`

gives you by
default, by specifying the number of bins with the argument `breaks`

.
Specify that you want a probability density histogram instead of
a frequency histogram using the argument `probability = TRUE`

.
Specify the axis limits so that both the histogram and the
density function fit nicely on the plotting area, and
give the plot a meaningful title
and axis labels.

This figure demonstrates the critical region of a F-test.
It contains the graph of the density of the F-distribution
with degrees of freedom parameters
`df1`

= 3 and `df2`

= 15. You can evaluate the value of the
density at `x`

with `df(x, df1, df2)`

.
The area in the right-hand tail of the distribution, where
proportion 0.05 of the probability mass lies, has been colored red.
The critical point can be calculated with

qf(0.05, df1, df2, lower.tail = FALSE)

Write R code for producing the figure. Use polygon(..., col = "red") for filling in the tail area. Draw first the graph of the pdf and then define the vertices defining the polygon to be filled in. (It is difficult to use exactly the same (x, y) points for both purposes).

We now try to visualize the frequency distributions of the points of the first course exam of the probability theory course. There were four problems graded 0—6, and the file lists for each of the participants the points that person obtaind from the problems.

Calculate an array `arr`

, whose columns you should give
names 'p1', ..., 'p4', rows names '0', ..., '6', and, e.g.,
`arr['3', 'p1']`

should contain the number of participants who obtained
three points from problem one.
Once you have done that, try what you get with

barplot(arr, beside = TRUE, legend.text = paste(0:6))

Try also the visualization given by `dotchart(arr)`

.

The easiest way to
calculate the frequencies for `arr`

is to use nested for-loops
(the brute force solution).
Alternatively, you
can
calculate the frequencies with an elegant one-liner where you might
use functions such as
`sapply()`

, `table()`

and `factor()`

(but this is much more difficult to write).

Last updated 2011-02-14 10:10

Petri Koistinen

petri.koistinen 'at' helsinki.fi