2 The parallel R package

Although R was not built for parallel computing, multiple ways of parallelizing your R code exist. One of these is the parallel package. This R package, shipped with base R, provides various functions to parallelize R code using embarrassingly parallel computing, i.e., a divide-and-conquer-type strategy. The basic idea is to start multiple R sessions (usually called child processes), connect the main session with those, and send them instructions. This section goes over a common workflow to work with R’s parallel.

2.1 Parallel workflow

(Usually) We do the following:

Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)
Copy/prepare each R session (if you are using a PSOCK cluster):
1. Copy objects with clusterExport
2. Pass expressions with clusterEvalQ
3. Set a seed
Do your call: parApply, parLapply, etc.
Stop the cluster with clusterStop

2.2 Types of clusters: PSOCK

Can be created with makePSOCKCluster
Creates brand new R Sessions (so nothing is inherited from the master), e.g.
```
# This creates a cluster with 4 R sessions
cl <- makePSOCKCluster(4)
```
Child sessions are connected to the master session via Socket connections
Can be created outside the current computer, i.e., across multiple computers!

2.3 Types of clusters: Fork

Fork Cluster makeForkCluster:
Uses OS Forking,
Copies the current R session locally (so everything is inherited from the master up to that point).
Data is only duplicated if altered (need to double check when this happens!)
Not available on Windows.

Other makeCluster: passed to snow (Simple Network of Workstations)

2.4 Ex 1: Parallel RNG with `makePSOCKCluster`

Caution

Using more threads than cores available on your computer is never a good idea. As a rule of thumb, clusters should be created using parallel::detectCores() - 1 cores (so you leave one free for the rest of your computer.)

# 1. CREATING A CLUSTER
library(parallel)
nnodes <- 4L
cl     <- makePSOCKcluster(nnodes)    
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
# 3. DO YOUR CALL
ans <- parSapply(cl, 1:nnodes, function(x) runif(1e3))
(ans0 <- var(ans))

              [,1]          [,2]          [,3]          [,4]
[1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
[2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
[3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
[4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Making sure it is reproducible

# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:nnodes, function(x) runif(1e3)))
# 4. STOP THE CLUSTER
stopCluster(cl)
all.equal(ans0, ans1) # All equal!

[1] TRUE

2.5 Ex 2: Parallel RNG with `makeForkCluster`

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)
# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
cl     <- makeForkCluster(nnodes)    
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123)
# 3. DO YOUR CALL
ans <- do.call(cbind, parLapply(cl, 1:nnodes, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))
(ans0 <- var(ans))

              [,1]          [,2]          [,3]          [,4]
[1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
[2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
[3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
[4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Again, we want to make sure this is reproducible

# Same sequence with same seed
clusterSetRNGStream(cl, 123)
ans1 <- var(do.call(cbind, parLapply(cl, 1:nnodes, function(x) runif(nsims))))
ans0 - ans1 # A matrix of zeros

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0    0    0    0
[4,]    0    0    0    0

# 4. STOP THE CLUSTER
stopCluster(cl)

Well, if you are a Mac-OS/Linux user, there’s a more straightforward way of doing this…

2.6 Ex 3: Parallel RNG with `mclapply` (Forking on the fly)

In the case of mclapply, the forking (cluster creation) is done on the fly!

# 1. CREATING A CLUSTER
library(parallel)
# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
# cl     <- makeForkCluster(nnodes) # mclapply does it on the fly
# 2. PREPARING THE CLUSTER
set.seed(123) 
# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims)))
(ans0 <- var(ans))

             [,1]        [,2]         [,3]         [,4]
[1,]  0.085384184 0.002390790  0.006576204 -0.003998278
[2,]  0.002390790 0.081142190  0.001846963  0.001476244
[3,]  0.006576204 0.001846963  0.085175347 -0.002807348
[4,] -0.003998278 0.001476244 -0.002807348  0.082425477

Once more, we want to make sure this is reproducible

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims))))
ans0 - ans1 # A matrix of zeros

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0    0    0    0
[4,]    0    0    0    0

# 4. STOP THE CLUSTER
# stopCluster(cl) no need of doing this anymore

2.1 Parallel workflow

2.2 Types of clusters: PSOCK

2.3 Types of clusters: Fork

2.4 Ex 1: Parallel RNG with makePSOCKCluster

2.5 Ex 2: Parallel RNG with makeForkCluster

2.6 Ex 3: Parallel RNG with mclapply (Forking on the fly)

2.4 Ex 1: Parallel RNG with `makePSOCKCluster`

2.5 Ex 2: Parallel RNG with `makeForkCluster`

2.6 Ex 3: Parallel RNG with `mclapply` (Forking on the fly)