Licensed under a Creative Commons BY-NC-SA 4.0 License.
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners, and several studies show that R’s popularity has increased substantially in recent years.
We plan to use R from the graphical tool RStudio, which is an IDE that provides a friendly work environment, very similar to that of MATLAB. Its main window is divided in four areas, from top to bottom and from left to right:
Run
will trigger executing the selected part below (the console).# This is a comment
# Simple operation
2^10 - 24
## [1] 1000
Assignation operator: <-
(=
is also valid):
# Assignation
x <- 1
x = 1
To put two commands in the same line a semi-colon must be used. By writing just the name of a variable the terminal outputs its value
# Two commands in the same line
x <- 2^10 - 24; x
## [1] 1000
Creating a Sequence.
# Simple sequence creation
x <- 1:10; x
## [1] 1 2 3 4 5 6 7 8 9 10
# More complex sequence creation by using seq(first, last, step) function
y <- seq(1, 2, 0.1); y
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
# Sequence made up as a concatenation of different parts (1, 5, and seq(100, 110, 2)).
z <- c(1, 5, seq(100, 110, 2)); z
## [1] 1 5 100 102 104 106 108 110
Accessing values on a sequence:
# Sequence creation
x <- 1:10;
# Accessing first value of the sequence
x[1]
## [1] 1
To display help about a function, use ?:
# Display help about concatenation function
? c
## starting httpd help server ... done
Functions definition:
# - n: number of values of fibonacci sequence to be calculated (required)
# - full: indicates whether we want the function to return the full sequence or just the n-th value (optional) DEFAULT=FALSE
fibonacci <- function(n, full=FALSE) {
# We reserve some memory for a numerical vector of length n
fibvals <- numeric(n)
# First two values of Fibonacci's sequence
fibvals[1] <- 1
fibvals[2] <- 1
# Loop to compute the next values
if (n > 2) {
x <- 3:n
for (i in x) {
fibvals[i] <- fibvals[i-1] + fibvals[i-2]
}
}
# Default behaviour: return just the n-th value
if (!full) {
return(fibvals[n])
}
# If full=TRUE, return the whole sequence
else {
return(fibvals)
}
}
# Just the 10-th element
fibonacci(10)
## [1] 55
# The whole series until the 10-th element
fibonacci(10, TRUE)
## [1] 1 1 2 3 5 8 13 21 34 55
The packages base
and stats
are available with the default R installation and provide many basic functions that are typically not available with other languages.
To take a look to the set of available distributions, just check the help:
? distributions
For each distribution xxx
there are four functions available:
rxxx(n, ...)
generates \(n\) samples following xxx
distribution.dxxx(x, ...)
value of the density function \(f(x)\) at the points specified by x (value of \(\Pr(x)\) for the case of discrete random variables).pxxx(q, ...)
value of the cumulative distribution function (CDF) \(F(q) = \int_{-\infty}^q f(x)dx\).qxxx(p, ...)
value of the p-th quantile, i.e., the \(q\) such that \(F(q) = p\) (inverse of the cumulative distribution function).Example: Generation of 1000 samples of a \(N(0,1)\) (gaussian of zero mean and variance 1) and representation of the results:
# Random number generation from a normal distribution
x <- rnorm(1000, 0, 1)
# Representation of the results
plot(x)
Values of the mean, median, variance and standard deviation:
mean(x)
## [1] -0.02482999
median(x)
## [1] -0.01313418
var(x)
## [1] 1.009967
sd(x)
## [1] 1.004971
The summary
function is very versatile. It takes as input a large number of different types of data and provides a lot of information:
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.37174 -0.68834 -0.01313 -0.02483 0.66443 3.49530
Represents how a random variable is distributed. For example, the shape of a density function \(f(x)\) of an \(N(0,1)\) distribution has the classical Gaussian bell shape, centered in zero. This is the empirical distribution of the data set:
# Empirical
## Random number generation (from a normal distribution)
x <- rnorm(1000, 0, 1)
## Representation of the density function of the generated values
plot(density(x))
# Theoretical
## We now overlap the theoretical density function of an N(0,1) using dnrom function.
## - add=TRUE overlaps the curve to the latest plot already generated (instead of generating a new one)
## - col="red" color of the line
## - lty=2 shape of the line
curve(dnorm(x, 0, 1), add=TRUE, lty=2, col="red")
Similarly, we can plot the empirical CDF together with the theoretical:
# Empirical CDF
x <- rnorm(1000, 0, 1)
Fx <- ecdf(x)
plot(x, Fx(x))
# And overlap the theoretical F(x) using pnorm function
curve(pnorm(x, 0, 1), add=TRUE, lty=2, col="red")
Another tool to analyse and compare two distributions is through the use of Q-Q plots. Q stands for quantile, and the analysis consists on visually comparing the values of the same quantiles for two different sets of data. In other words, it compares the grouwth of two data sets. If the result is a line with zero offset and slope 1, then both sets have a similar shape, and therefore they probably follow the same random variable
Generate 1000 samples of an uniform distribution beween 0 and 1 (square pulse) and compare empirical and theoretical density functions:
# Empirical
x <- runif(1000, 0, 1)
plot(density(x))
# Theoretical
curve(dunif(x, 0, 1), add=TRUE, lty=2, col="red")
Looking at the figure, is difficult to conclude that our data set follow the theoretical distribution just looking at the density functions. Let’s now check the Q-Q plot:
# Q-Q plot generation
qqplot(x, runif(length(x), 0, 1))
# Plots a line with zero offset and slope 1
abline(0, 1, lty=2, col="red")
In this case, we can see that both datasets grow in a similar way. Hence, both data sets probably follow the same random variable.
Generate 2 sets of 1000 samples from an exponential distribution and compare them using a Q-Q plot.
x <- rexp(1000)
y <- rexp(1000)
qqplot(x, y)
abline(0, 1, lty=2, col="red")
When comparing using Q-Q plot, sometimes we see some noise towards the edges and this is due to the lack of samples in those regions (and therefore the quantiles are less accurate). In order to increase accuracy on those regions, we need a higher number of samples:
# Higher number of samples
x <- rexp(1000000)
y <- rexp(1000000)
qqplot(x, y)
abline(0, 1, lty=2, col="red")
When exploring large data sets, none of the previous metrics/tools are enough separately. To illustrate this, let us take the synthetic dataset called Datasaurus Dozen, available on CRAN:
if (!requireNamespace("datasauRus", quietly=TRUE))
install.packages("datasauRus")
library(datasauRus)
unique(datasaurus_dozen$dataset)
## [1] "dino" "away" "h_lines" "v_lines" "x_shape"
## [6] "star" "high_lines" "dots" "circle" "bullseye"
## [11] "slant_up" "slant_down" "wide_lines"
head(datasaurus_dozen)
## dataset x y
## 1 dino 55.3846 97.1795
## 2 dino 51.5385 96.0256
## 3 dino 46.1538 94.4872
## 4 dino 42.8205 91.4103
## 5 dino 40.7692 88.3333
## 6 dino 38.7179 84.8718
We have 13 datasets (identified by the dataset
column), and every dataset consists of two columns, \(x\) and \(y\). Let us compute averages and standard deviations for both coordinates and per dataset:
for (name in unique(datasaurus_dozen$dataset)) {
dataset <- subset(datasaurus_dozen, dataset==name)
print(sprintf(
"mean_x = %.3f, sd_x = %.3f, mean_y = %.3f, sd_y = %.3f <- %s",
mean(dataset$x), sd(dataset$x), mean(dataset$y), sd(dataset$y), name))
}
## [1] "mean_x = 54.263, sd_x = 16.765, mean_y = 47.832, sd_y = 26.935 <- dino"
## [1] "mean_x = 54.266, sd_x = 16.770, mean_y = 47.835, sd_y = 26.940 <- away"
## [1] "mean_x = 54.261, sd_x = 16.766, mean_y = 47.830, sd_y = 26.940 <- h_lines"
## [1] "mean_x = 54.270, sd_x = 16.770, mean_y = 47.837, sd_y = 26.938 <- v_lines"
## [1] "mean_x = 54.260, sd_x = 16.770, mean_y = 47.840, sd_y = 26.930 <- x_shape"
## [1] "mean_x = 54.267, sd_x = 16.769, mean_y = 47.840, sd_y = 26.930 <- star"
## [1] "mean_x = 54.269, sd_x = 16.767, mean_y = 47.835, sd_y = 26.940 <- high_lines"
## [1] "mean_x = 54.260, sd_x = 16.768, mean_y = 47.840, sd_y = 26.930 <- dots"
## [1] "mean_x = 54.267, sd_x = 16.760, mean_y = 47.838, sd_y = 26.930 <- circle"
## [1] "mean_x = 54.269, sd_x = 16.769, mean_y = 47.831, sd_y = 26.936 <- bullseye"
## [1] "mean_x = 54.266, sd_x = 16.769, mean_y = 47.831, sd_y = 26.939 <- slant_up"
## [1] "mean_x = 54.268, sd_x = 16.767, mean_y = 47.836, sd_y = 26.936 <- slant_down"
## [1] "mean_x = 54.267, sd_x = 16.770, mean_y = 47.832, sd_y = 26.938 <- wide_lines"
which are close to equal. But, are the distribution similar? In this case, it is enough to simply plot each dataset to verify that they are actually very different:
library(ggplot2)
ggplot(datasaurus_dozen, aes(x=x, y=y)) +
geom_point()+
facet_wrap(~dataset, ncol=4)
The distribution of the addition of two random variables
# Empirical
x <- rnorm(1000, 0, 1)
y <- rnorm(1000, 10, 3)
z <- x + y
# Plot density function
plot(density(z))
# To confirm that the addition results in a convolution, we follow these steps
# to plot the theoretical function on top:
# Density function of x
fx <- function(x){
dnorm(x, 0, 1)
}
# Density function of y
fy <- function(x){
dnorm(x, 10, 3)
}
# Density function of the addition
fadd <- function(x, y){
return(fx(y-x)*fy(x))
}
# Density function of the addition
fz <- function(x){
# Takes a numeric argument and returns a numeric vector of the same length
value <- integrate(fadd, -Inf, Inf, x)$value
return(value)
}
# Vectorize the function that calculates the integral
# (this is because we need our function to be aligned with the output of integrate())
fz <- Vectorize(fz)
curve(fz(x), add=TRUE, lty=2, col="red")
We now repeat the procedure adding two uniform distributions between 0 and 1 (two square pulses), which should result in a triangular pulse with double width:
# Empirical
x <- runif(1000, 0, 1)
y <- runif(1000, 0, 1)
z <- x + y
plot(density(z))
# Theoretical
fx <- function(x){
dunif(x, 0, 1)
}
fy <- function(x){
dunif(x, 0, 1)
}
fadd <- function(x, y){
return(fx(y-x)*fy(x))
}
fz <- function(x){
value <- integrate(fadd, -Inf, Inf, x)$value
return(value)
}
fz <- Vectorize(fz)
curve(fz(x), add=TRUE, lty=2, col="red")
The conditional probability of A given B is defined as
\[\Pr(A | B) = \frac{\Pr(A \cap B)}{\Pr(B)}\]
Correspondingly,
\[\Pr(B | A) = \frac{\Pr(A \cap B)}{\Pr(A)}\]
Based on the above, the probability of both events A and B can be expressed as
\[ \Pr(A \cap B) = \Pr(A | B)\Pr(B) = \Pr(B | A)\Pr(A)\]
which results in the well-known Bayes’ theorem:
\[ \Pr(A | B) = \frac{\Pr(B | A)\Pr(A)}{\Pr(B)}\]
Bayes’s theorem is to the theory of probability what Pythagoras’s theorem is to geometry. (Sir Harold Jeffreys)
We next illustrate the usefulness of the theorem with an example. Assume there is a test to detect a certain illness which has 1 every 10.000 people. The sensitivity (probability of detection) of the test is 0.99 and its specificity (i.e., probability of negative result in healthy individuals) is 0.99.
Given that the result of the test is positive, what is the probability of being ill? To sum up, there is the following data available so far:
The objective is to compute the probability of having the illness given that the result of the test was positive. According to Bayes’ theorem:
\[ \Pr(\mathrm{ill} | +) = \frac{\Pr(+ | \mathrm{ill})\Pr(\mathrm{ill})}{\Pr(+)}\]
To compute the probability of the test resulting positive:
\[\begin{aligned} \Pr(+) &= \Pr(+ | \mathrm{ill})\Pr(\mathrm{ill}) + \Pr(+ | \mathrm{healthy})\Pr(\mathrm{healthy}) \\ &= \Pr(+ | \mathrm{ill})\Pr(\mathrm{ill}) + \left(1-\Pr(- | \mathrm{healthy})\right)\left(1-\Pr(\mathrm{healthy})\right) \\ & = 0.99\cdot0.0001 + (1-0.99)\cdot(1-0.0001) = 0.010098 \end{aligned}\]
Therefore the result is
\[ \Pr(\mathrm{ill} | +) = \frac{0.99\cdot0.0001}{0.010098} \approx 0.01 \]
To confirm this (conter-intuitive but) low value, we will run some simulations with the help of R.
# Function to generate healthy/ill data
population <- function(n) {
# Vector (TRUE/FALSE)
sick <- runif(n) < 0.0001
people_df <- data.frame(sick)
return(people_df)
}
# Next, we specify a function that emulates the result of a test
# with the performance described above
test <- function(people_df, sensitivity, specificity) {
random <- runif(nrow(people_df))
people_df$testedPositive <- (people_df$sick & (random < sensitivity)) |
(!people_df$sick & (random > specificity))
return(people_df)
}
# We generate a population similar in size to Madrid
madrid <- population(3000000)
# The result is a one-column data frame storing who is sick
head(madrid)
## sick
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
# Everybody is tested
madrid <- test(madrid, 0.99, 0.99)
# A new column indicates whether each test was positive
head(madrid)
## sick testedPositive
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE
# And now we compute the amount of individuals with the illness that got
# a positive result form the test
positive <- subset(madrid, testedPositive)
positiveAndSick <- subset(madrid, testedPositive & sick)
nrow(positiveAndSick) / nrow(positive)
## [1] 0.009713275
And this is the reason why these tests are repeated.
It is left as an exercise for the reader to repeat the test multiple times over those identified as positive. What is the probability of having the illness if a second test is also positive? And a third test? What is more important, sensitivity or specificity? Why?
Assume that in a given network, traffic generation follows the models described in the table below, where the first column describes the traffic type, the second column the distribution of the length of datagrams, and the third column the relative amount of each type of dataframe. Compute the expected lenght of a datagram picked at random.
Application | Length (B) | % |
---|---|---|
Skype | \(U(50,150)\) | 5 |
P2P | \(U(1000, 1500)\) | 60 |
Web | \(exp(1/1000)\) | 25 |
\(N(800,100)\) | 10 |
It is relatively straigthforward to solve this problem with R. Furthermore, one can estimate the density function of the datagram lenght, which is defined buy four different random variables. To do this, we generate a number of samples from each random variable, with the corresponding weigth (third column), and mix them with sample
:
# Total number of samples to be generated
N <- 1000
# Generate a vector of samples
pkts <- sample(c(
runif(0.05 * N, 50, 150),
runif(0.60 * N, 1000, 1500),
rexp (0.25 * N, 1/1000),
rnorm(0.10 * N, 800, 100)
))
With the above, we plot the estimated density function and the average:
plot(density(pkts))
abline(v=mean(pkts), lty=2, col="red")
mean(pkts)
## [1] 1059.478
According to the figure, the density function cannot be related to any other classical continuous random variable (e.g., it is multi-modal). Concerning the sample average, it should follow a normal distribution, so if we repeat the above procedure many times the distribution of the obtained sample averages should resemble a gaussian distribution.
While the above could be done with a for
statement, with R there are tools to replace loops by functions. In this case, we can use sapply
to apply the same function over a vector and produce a vector of results. To repeat the above procedure 1000 times, we can use the following code:
# Calculates the average of the sample
gen <- function(i) {
pkts <- sample(c(
runif(0.05 * N, 50, 150),
runif(0.60 * N, 1000, 1500),
rexp (0.25 * N, 1/1000),
rnorm(0.10 * N, 800, 100)
))
return(mean(pkts))
}
# Sapply method calls 1000 times gen() method, storing all the results in a vector
pkts_avgs <- sapply(1:1000, gen)
The resulting variable pkts_avgs
has the averages of the 1000 repetitions. We can confirm via visual inspection that the estimated density function follows the gaussian one:
plot(density(pkts_avgs))
And furthermore we can run a Student t-test, to have an estimation of the average value with a confidence interval:
t <- t.test(pkts_avgs)
t
##
## One Sample t-test
##
## data: pkts_avgs
## t = 2079.7, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1083.589 1085.636
## sample estimates:
## mean of x
## 1084.613
# Solution: 1085
plot(density(pkts_avgs))
abline(v=t$estimate, lty=2, col="red")
abline(v=t$conf.int)
Given two independent random variables \(\mu_1\) and \(\mu_2\) uniformly distributed between 0 and 1. Compute the expectation of the random variable defined as the minimum of them.
\[E[min(\mu_1, \mu_2)]\]
f <- function(i){
u1 <- runif(1000, 0, 1)
u2 <- runif(1000, 0, 1)
return(mean(pmin(u1, u2)))
}
out <- sapply(1:1000, f)
t <- t.test(out)
t
##
## One Sample t-test
##
## data: out
## t = 1435, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.3325567 0.3334675
## sample estimates:
## mean of x
## 0.3330121
# Solution: 1/3
plot(density(out))
abline(v=t$estimate, lty=2, col="red")
abline(v=t$conf.int)
40% of the network packets suffer network delay that can be modeled with a random variable uniformly distributed between \(10 ms\) and \(70 ms\), and the remaining 60% suffer a delay modeled by an exponential random variable of mean \(30 ms\). Compute the average delay of those packets with more than \(50 ms\) of delay.
delay_threshold <- 50
lambda <- 1/30
get_avg <- function(x){
u1 <- runif(4000, 10, 70)
e1 <- rexp(6000, lambda)
u1 <- u1[u1>50]
e1 <- e1[e1>50]
# combine and randomize generated data
samples <- sample(c(u1, e1))
return(mean(samples))
}
average <- sapply(1:1000, get_avg)
mean(average)
## [1] 69.20544