Tuesday, December 2, 2014

RStudio in the cloud for dummies, 2014/2015 edition

In 2012, we presented a post showing how to run RStudio in the cloud on an Amazon server. There were 7 steps, including one with 7 sub-steps, one of which had 6 sub-sub-steps. It was still pretty easy, for what it was-- an effectively free computer in the cloud to run R on.

Today, we show the modern-- 3 years later!-- way to get the same result, only this approach is much easier, and the resulting installation includes all the best goodies of RStudio, including Markdown -> PDF and Hadley Wickham's packages pre-installed. Update, 2016: Digital ocean has changed their set-up, slightly. Check out the first step or two of this post in place of the first two steps below, if you're just starting out.

The approach builds on Docker, an infrastructure that saves start-up time and overhead, as well as efforts led by Dirk Eddelbuettel and Carl Boettiger to develop a Docker application of R. This project is called Rocker, and interested readers are encouraged to read the details. But if you want to just get up and running, here are the simple steps to get going.



1. Go to Digital Ocean and sign up for an account. By using this link, you will get a $10 credit. (Full disclosure: Ken will also get a $25 credit once you spend $25 real dollars there.) The reason to use this provider is that they have a system ready to run with Docker already built in. In addition, their prices are quite reasonable. You will need to use a credit card or PayPal to activate your account, but you can play for a long time with your $10 credit-- the cheapest machine is $.007 per hour, up to a $5 per month maximum.

2. On your Digital Ocean page, click "Create droplet". Then choose an (arbitrary) name, a size (meaning cost/power) of machine, and the region closest to you. You can ignore the settings. Under "Select Image", choose the "Applications" tab and select "Docker (1.3.2 on 14.04)". (The numbers in the parentheses are the Docker and Ubuntu version, and might change over time.) Then click "Create Droplet" at the bottom of the page.

3. It takes about a minute for the machine to start up. When it's ready, click the "Console Access" button. This opens a text terminal to your Ubuntu machine, inside your web page. Press enter to get a prompt, and log in (your username is root) using the password that was sent to your e-mail. You'll have to change the password.

4a. To start a terminal session of R, type
docker run --rm -ti rocker/r-base
you should see a bunch of messages about pulling and downloading, but eventually you will get the ">" prompt-- you can do R in here, but who would want to?

4b. To get RStudio server running, type
docker run -d -p 8787:8787 rocker/rstudio
But this is really not where you want to be. Instead, run the following command, to get a set-up that includes more useful packages installed in and with R.
docker run -d -p 8787:8787 rocker/hadleyverse


5. Use it! The IP address of your server is displayed below the terminal where you typed in your docker command. Open a new browser tab and go to the address http://(ip address):8787. For example: http://135.104.92.185:8787. You'll see the RStudio login screen, and can enter "rstudio" (without the quotes) as the username and password. The system is well tuned enough that you can open a new file --> markdown --> PDF and immediately click "Knit PDF", and see the example document beautifully presented back to you in moments.

That's it. It's still way cooler than sliced bread. let us know if you try it, and if you run into any trouble. Oh, and if you're feeling creeped out by the standard username and password in your RStudio, you can set them up from your docker command as follows.
docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse
Other customization details and further information can be found on this Rocker page.

Update
I should perhaps have noted that what you are running here is in fact RStudio Server, and that you can allow additional users on your RStudio using instructions found here.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

13 comments:

nick said...

Seems like really cool technology. What are the use cases that you envision for this system? Will it be a place for me to run my computationally intensive R jobs? A place for teaching R? A place where (eventually?), I could run Shiny or RStudio Server with low overhead?

Ken Kleinman said...

Hi Nick--

All of the above? I don't know what a 64GB/20CP machine costs these days, but you can have one for 10 hours for $10. This is already RStudio Server, and you can easily open it to more users, if you want to teach and don't have access to a university IT department. And I think Shiny can't be far behind, if it's not available yet.

ben said...

It worked well! It was quick too, even on the smallest box they offer. Thank you for doing this.

Ken Kleinman said...

My pleasure, Ben! Have fun.

Anonymous said...

It's pretty important to set up a swapfile because Digital Ocean droplets don't have swap space preconfigured. The following will do the trick:

#Create 2GB swapfile and activate
sudo install -o root -g root -m 0600 /dev/null /swapfile
dd if=/dev/zero of=/swapfile bs=1k count=2048k
mkswap /swapfile
swapon /swapfile
echo "/swapfile swap swap auto 0 0" | sudo tee -a /etc/fstab
sudo sysctl -w vm.swappiness=10
echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf

Ken Kleinman said...

Thanks, anonymous-- can you explain briefly why this is important? My *nix experience is too long ago.

I'll update the entry to reflect this pointer once I understand.

Anonymous said...

Hi,

The swap space is like virtual memory or the page file on Windows. It's usually provided as a separate partition on the hard drive, but Digital Ocean doesn't provide this by default, so you can designate a file to be used for swap space.

R does lots of things really well, but memory management isn't one of them. The cheapest droplet size on Digital Ocean gives you 512 MB of RAM. This is not a lot. Yes, if all you wanted to do was load up R objects for reference and then write them back to disk, you would have a problem, but if you want to do anything useful with the data, you need more space than just storage. The guideline I read in a programming book that I'm afraid I can't remember at the moment but has been empirically true for me is that you need about 10 times the space of an R object to actually manipulate it. Swap space gives you that breathing room.

It means that if you do need more than 512 MB (or whatever fraction of that R gets) to move stuff around, the system will just slow down instead of crashing, and with an SSD providing the space, this is a more modest slowdown than it would be with a spinning disc.

Anonymous said...

Part 2 (sorry):

A general guideline is that you should have a swap space about twice the size of the RAM on your system. I went higher for R cloud systems because of how important RAM is for R (and any packages you might use that need Java). The "swappiness" parameter controls how often the system uses the swap space. I have not spent any time at all optimizing this, I just set it high. That may or may not have been the best default, but in practice it works well for me.

Unknown said...

It worked great! Thanks for that.
Still I have some questions:

I shutdown the system via the console and restarted the "Droplet". After starting the Droplet RStudio-Server was not accessible via the web browser. Do I have always to login in as root and to start the rstudio server via the console? Or do I have to start the docker instance each time after reboot by an specific command? I would prefer to shutdown the system to save online time resp. money (even if it is quite cheap).

Hadley Wickham said...

You might want to try https://github.com/sckott/analogsea which lets you run all those commands from R

Ken Kleinman said...

Interesting, Hadley. Thanks.

xiaodai said...

This is really good and awesome for dummies like myself.

However I want to save money so I made a snapshot of the droplet and destroyed the droplet.

Later I started another droplet and asked to create the new droplet from the snapshot I have taken. But when I started the new droplet I find that all the packages I have installed are gone.

All I did was run this command docker run -d -p 8787:8787 rocker/hadleyverse. I am a Windows user so I am not familiar with Linux at all.

What should I have done instead to keep all the packages I installed when I took the snapshot?

Ken Kleinman said...

I think you might have to design your own docker/rocker container to do that. But I don't know enough about how they work to be sure.