Recently I’ve been asked several times “Which environment should I install for machine learning purposes?”. This question comes in many other forms, including:
“Do I need to buy specific hardware for deep learning projects?”
“Do I need to install Linux for applying deep learning?”
“I have a laptop with/ without GPU, what should I do?”
“What is the quickest/ cheapest way to get started?”
It seems that in the area of deep learning, sometimes the algorithms and math side of things are easier to understand and practice then the engineering and IT counterparts. Here at Body Vision Medical we are applying deep learning to solve computer vision problems like segmentation and localization on large data sets of medical images. While doing so we have gained experience with several setups which we will share.
In this blog, we will go over several popular options for deep learning setup and when you should choose each one. I hope this will give you some sense of when to apply each approach and what is possible. However, this is by no means a comprehensive list, just a survey of the most popular and relevant options out there.
The Deep Learning Stack
The deep learning stack is comprised of four layers. For each layer we have several options to choose from. The space of different environments is basically a Cartesian product of the items in each layer of the stack:
Compute platform - local machine, cloud service, on site server.
Processing unit - CPU, GPU.
Operating system - Windows, MacOS, Linux.
Compartmentalization - bare metal, VM, Docker.
* Note that we intentionally don’t consider the different DL libraries and frameworks (Tensorflow, PyTorch, Theano etc...) as part of the stack, you should be able to use most, if not all of those, regardless of the choice you make regarding the lower layers of the stack.
Now, let’s go over several setups. For each setup we’ll cover:
- The specific combination from each layer of the stack.
- When this setup is most relevant.
- Additional comments.
Quick and Dirty
- Local Machine
- Available hardware, whether its CPU or GPU.
- Currently installed OS (Whatever is already installed - Windows/ Linux/ MacOS).
When to Use
This setup provides the fastest way to get started. If you want to get your hands dirty for the first time or want to run a quick small scale experiment, this is the way to go.
If you hit a wall with compute power and model capacity or training job duration, it’s time to consider a fancier setup.
Take note that you can go a long way without a GPU on your own laptop or PC.
CLOUD SERVICE VM
- Cloud Service - Google Cloud Platform, Amazon Web Services, Azure etc..
- Optional - Docker
When to Use
You've got a model kind-of running, but want to train it on larger datasets or want to increase the number of parameters of the neural network.
You or your organization are willing to make a small monetary investment to advance the DL project but, are not yet ready to commit for an expensive DL server, so the pay per use model of the cloud service providers is a good fit to test the usefulness of high end GPUs.
As far as the cloud service provider choice goes, I recommend considering the following:
- Are there any onboarding discounts or credits for new users?
- The GPU cards available, are they suitable for your needs?
Currently I would recommend to choose the Google Cloud Platform for 2 reasons:
- It offers $300 in credits for new users - This could last for few weeks to finish entire project.
- It’s the only platform that offers both Nvidia K80 and Nvidia P100 cards as of time of writing. For comparison, AWS offers only K80s (probably will change in the future).
Since the development is going to happen on a remote machine, we’ll need to address this issue as well. In case you are using Python (and if you doing DL, you should) then the recommended setup is to work with remote pycharm. This setup allows to run pycharm on your local machine, but run remote processes and upload code files and even debug via SSH. This gives the most “seamless” experience. Other remote development options include: RDP, Team Viewer, SSH (either in console or GUI mode).
On Site Server
- On Site Server
image from https://youtu.be/VAkupsKQlaY?t=115
When to Use
You are part of a DL team and running models on large datasets. The cloud service costs are high enough to make the investment in expensive DL worthwhile and cost-effective.
- Start off with purchasing a server with top of the line CPU and available expansion slots for GPUs. As the team grows and activities increase, add more GPU cards to the server.
- Usually when several users are using the server, it is customary to assign GPU devices to users. That way each user defines in his code which devices he utilizes.
- The Remote Development chapter is relevant here as well.
- Docker - Consider creating a team image for docker and sharing it across the project. This will speed up development and deployment.
As you might have noticed, I have not expanded on the choice of an operating system. This is because issue has much to do with personal preference and tends to start “flame wars”. However, if you are thinking which OS to choose from MacOS, Linux, Windows, there are some factors to take into consideration when choosing OS.
Libraries Compatibility - While this wasn’t the case until recently, nowadays most popular DL libraries and frameworks support (officially or by community fork) all major OSs.
Open Source DL Code - There is a lot of great DL code on GitHub: research code, models people used on kaggle and more. A good portion of those repositories were developed for Linux and will require adjustments in order to work on other OS either because they use older DL frameworks that work on Linux or because they use paths and API which are Linux specific.
Nvidia-docker - In case you would like to work with docker and have GPU support on the container you’ll a tool called nvidia-docker. This tool currently works only on Linux (however, Windows support is planned for next version).
Remote Development - You will need to match the remote development solution of your choice to the specific OS.
I hope this article helps a little to clear the “IT” fog around DL and how to approach this issue. Of course there’s always much to learn and things will evolve over time so you’ll need to keep yourself up-to-date with the latest on the topic.