Software Tools / R: Homework 2: due Monday, 7 Feb, 2010

Send your solutions to the lecturer by email (firstname.lastname@helsinki.fi), at the latest on Mon 7 Feb at 10.00. Use R2 as the title of your message.


Exercise 1. (Reading data)

The data file tn10.csv corresponds to the Excel file tn10.xsl. It was written out from Excel using the CSV (comma separated values) format.

First take a look at the data file and read the help of the R function read.table(). Next formulate a command by which you can create a data frame called results which holds the contents of the file. Check that the data frame contains 82 observations of 19 variables.


Suggested solution:

Reading through the help we find out that the function read.csv2 does the job.

results <- read.csv2('tn10.csv')
str(results)

Exercise 2. (Manipulating a data frame)

The following commands creates the data frame results from the data in the file tn10.dat even if you failed to solve the previous problem

results <- read.table('tn10.dat')

This data pertains to points from the course on probability theory (Todennäköisyyslaskenta) held in 2010. To protect the culprits, the names of the students have been omitted and their student numbers have been replaced with random numbers. The variables H1, ..., H10 contain points from each of the 10 exercise sessions of the course, the variables K11, ... K14 contain the points from the problems of the first course exam, and the variables K21, ..., K24 the points from the problems of the second course exam

  1. Write commands which add variables H.sum and K.sum to the data frame. H.sum should contain the sum of the points from the exercise sessions, but all the NA values should be counted as zero (hint: rowSums()). K.sum should contain the sum of the points from both of the two course exams.
  2. Write a command which creates the variable Extra.points based on the value of the variable H.sum such that if H.sum is in the range 10 <= H.sum < 15, then Extra.points is 1; for the range 15 <= H.sum < 20 Extra.points is 2; and so on; finally for 40 points or more Extra.points is 7. A one-line solution can be written using the function cut() (but some of you may find it easier to solve the problem with some other approach).
  3. How do you find out the student numbers (variable opisnro) of those students who did not obtain any points from the two course exams. (Watch out for missing values.)

Suggested solution:

# 1:
H.sum <- rowSums(results[ , 2:11], na.rm = TRUE)
results$H.sum <- H.sum
K.sum <- with(results,
  K11 + K12 + K13 + K14 + K21 + K22 + K23 + K24)

# Let me remove NA's
K.sum[is.na(K.sum)] <- 0
results$K.sum <- K.sum

# 2:
lims <- c(0, seq(10, 40, by = 5), Inf)
results$Extra.points <-
  (0:7)[cut(results$H.sum,  breaks = lims, right = FALSE)]


# 3:
# the following works irrespective of whether NA's have been removed or not

ind <- is.na(K.sum) | (K.sum == 0)
results$opisnro[ind]

Exercise 3. (Function apply and related functions)

Applied to a vector x, the following call calculates quantiles corresponding to the probabilities 0.1, 0.5 and 0.9.

quantile(x, probs = c(0.1, 0.5, 0.9))
Now we want to calculate these quantiles for each of the numeric variables of the iris data set. Give a solution using one of the apply-type functions.

Suggested solution:

First find out which of the variables are numeric, then use lapply() or sapply(). In the call sapply(X, FUN, args) the additional arguments args are passed to function FUN.

data(iris)
ind <- sapply(iris, is.numeric)
sapply(iris[ind], quantile, probs = c(0.1, 0.5, 0.9))

Exercise 4. (Writing functions)

In the U.S. temperatures are usually expressed in degrees Fahrenheit (F) instead of degrees Celsius (C), which are used in the rest of the world. The conversion formula between the two temperature scales is the following.

C = 5 / 9 * (F - 32)

Write function FtoC which converts temperatures given in degrees Fahrenheit into degrees Celsius. Also write function CtoF which converts temperatures given in degrees Celsius into degrees Fahrenheit.


Suggested solution:

FtoC <- function(F) 5/9*(F - 32)


CtoF <- function(C) 9/5 * C + 32

# Try them out:

FtoC(seq(0, 100, by = 5))
CtoF(seq(-20, 40, by = 5))

Exercise 5. (For-loop)

Supposedly, a school teacher gave C. F. Gauss, at the age of seven, the problem of summing the integers 1, 2, ..., 100. Gauss found the answer almost instantly (without having been told about the arithmetic series).

1. How do you find that sum by using the function sum()?

2. For the sake of practice, do the same calculation using a for-loop. Of course, your first solution, with the sum-function, is much clearer and shorter.


Suggested solution:

# 1.
sum(1:100)
# 2.
s <- 0
for (i in 1:100) s <- s + i

Exercise 6. (Merging data from several sources)

Sometimes the data must be combined from several sources and there is a key which tells which of the experimental unit the data item comes from.

The names of the students of the probability theory course are kept in a separate register, from which they have been extracted to the file tn10nimet.dat (but the real names have been changed). Unfortunately, the names are not listed in the same order as in the file tn10.dat where the points can be found. Instead the student number (variable opisnro in file tn10.dat and variable nro in the file tn10nimet.dat) identify the students uniquely.

Produce a data frame whose first three variable contain the student number (opisnro), the first name (etunimi) and the family name (sukunimi) and the rest of the variables are for the points from the exercise sessions and the two course exams. (Hint: merge().)


Suggested solution:

results <- read.csv2('tn10.csv')
nimet <- read.table('tn10nimet.dat', as.is = TRUE)
d <- merge(nimet, results, by.x = 'nro', by.y = 'opisnro')
str(d)

Last updated 2011-02-11 17:33
Petri Koistinen
petri.koistinen 'at' helsinki.fi