Install Rstudio desktop on ubuntu 20¶

In this tutorial, we will install Rstudio desktop on ubuntu 20.

Install the Rstudio desktop deb¶

You can find all the available version in ths page: https://posit.co/download/rstudio-desktop/#download

# install r
sudo apt-get install r-base

# gdebi-core lets you install local deb packages resolving and installing its dependencies
sudo apt-get install gdebi-core

# some dependency packages
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev

# get the rstudio desktop deb file
# note the desktop version is built with electron. So the diff between server and desktop must be very small
wget https://download1.rstudio.org/electron/focal/amd64/rstudio-2023.09.1-494-amd64.deb

# install the deb file
sudo gdebi rstudio-2023.09.1-494-amd64.deb

# launch the desktop
rstudio

Install R packages¶

You need to run below command in R console

# install devtools
install.packages("devtools")

# install sparklyr
install.packages("sparklyr")

Trouble shoot¶

If you have encounter errors such as No package 'libxml-2.0' found while installing sparklyr. You need to run below command, it's a system dependency not R.

sudo apt-get install libxml2-dev

Load packages to current R session¶

library(sparklyr)

library(dplyr)

Connect to a remote spark cluster¶

# set env var, if sparklyr can't find where is spark and hadoop
Sys.setenv(SPARK_HOME="/home/pengfei/opt/spark-3.3.0")
# R session can't load the env var by default
Sys.setenv(HADOOP_CONF_DIR="/home/pengfei/opt/hadoop/etc/hadoop")

# custom spark session config
conf <- spark_config()

# if we add nothing into the conf, the spark default conf will be loaded
conf$spark.executor.memory <- "300M"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 3
conf$spark.dynamicAllocation.enabled <- "false"

conf$spark.submit.deployMode <- "client"
conf$spark.eventLog.enabled <- "true"
conf$spark.eventLog.dir <- "hdfs://10.50.5.67:9000/spark-logs"
conf$spark.history.provider <- "org.apache.spark.deploy.history.FsHistoryProvider"
conf$spark.history.fs.logDirectory <- "hdfs://10.50.5.67:9000/spark-logs"
conf$spark.history.fs.update.interval <- "10s"
conf$spark.history.ui.port <- 18080

# create a spark session
sc <- spark_connect(master = "yarn", version="3.4.1", config=conf)

Do some query¶

# read data from hdfs_optimisation.md
spark_read_csv(sc, name = "test", path = "hdfs://10.50.5.67:9000/user/rstudio/flights/airports.csv")

# do some analysis


# close the spark session
spark_disconnect(sc)