Although R was not built for parallel computing, multiple ways of parallelizing your R code exist. One of these is the parallel package. This R package, shipped with base R, provides various functions to parallelize R code using embarrassingly parallel computing, i.e., a divide-and-conquer-type strategy. The basic idea is to start multiple R sessions (usually called child processes), connect the main session with those, and send them instructions. This section goes over a common workflow to work with R’s parallel.
2.1 Parallel workflow
(Usually) We do the following:
Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)
Copy/prepare each R session (if you are using a PSOCK cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
Do your call: parApply, parLapply, etc.
Stop the cluster with clusterStop
2.2 Types of clusters: PSOCK
Can be created with makePSOCKCluster
Creates brand new R Sessions (so nothing is inherited from the master), e.g.
# This creates a cluster with 4 R sessionscl <-makePSOCKCluster(4)
Child sessions are connected to the master session via Socket connections
Can be created outside the current computer, i.e., across multiple computers!
Copies the current R session locally (so everything is inherited from the master up to that point).
Data is only duplicated if altered (need to double check when this happens!)
Not available on Windows.
Other makeCluster: passed to snow (Simple Network of Workstations)
2.4 Ex 1: Parallel RNG with makePSOCKCluster
Caution
Using more threads than cores available on your computer is never a good idea. As a rule of thumb, clusters should be created using parallel::detectCores() - 1 cores (so you leave one free for the rest of your computer.)
# 1. CREATING A CLUSTERlibrary(parallel)nnodes <- 4Lcl <-makePSOCKcluster(nnodes) # 2. PREPARING THE CLUSTERclusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`# 3. DO YOUR CALLans <-parSapply(cl, 1:nnodes, function(x) runif(1e3))(ans0 <-var(ans))
# I want to get the same!clusterSetRNGStream(cl, 123)ans1 <-var(parSapply(cl, 1:nnodes, function(x) runif(1e3)))# 4. STOP THE CLUSTERstopCluster(cl)all.equal(ans0, ans1) # All equal!
[1] TRUE
2.5 Ex 2: Parallel RNG with makeForkCluster
In the case of makeForkCluster
# 1. CREATING A CLUSTERlibrary(parallel)# The fork cluster will copy the -nsims- objectnsims <-1e3nnodes <- 4Lcl <-makeForkCluster(nnodes) # 2. PREPARING THE CLUSTERclusterSetRNGStream(cl, 123)# 3. DO YOUR CALLans <-do.call(cbind, parLapply(cl, 1:nnodes, function(x) {runif(nsims) # Look! we use the nsims object!# This would have fail in makePSOCKCluster# if we didn't copy -nsims- first. }))(ans0 <-var(ans))
# Same sequence with same seedclusterSetRNGStream(cl, 123)ans1 <-var(do.call(cbind, parLapply(cl, 1:nnodes, function(x) runif(nsims))))ans0 - ans1 # A matrix of zeros
Well, if you are a Mac-OS/Linux user, there’s a more straightforward way of doing this…
2.6 Ex 3: Parallel RNG with mclapply (Forking on the fly)
In the case of mclapply, the forking (cluster creation) is done on the fly!
# 1. CREATING A CLUSTERlibrary(parallel)# The fork cluster will copy the -nsims- objectnsims <-1e3nnodes <- 4L# cl <- makeForkCluster(nnodes) # mclapply does it on the fly# 2. PREPARING THE CLUSTERset.seed(123) # 3. DO YOUR CALLans <-do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims)))(ans0 <-var(ans))