Some time ago, we published an article called Machine Learning in a Nutshell that became one of the most viewed posts in our blog. As such, we’ve decided to continue our series of publications about what it takes to build and deliver an AI-based solution, where to get talent and resources, how to save costs and streamline development with reusable code, etc to better guide you along your digital transformation journey.
This article will provide an overview of some of the most effective and widely used libraries and languages for machine learning (ML) and deep learning (DL) solutions engineering and deployment. The tools chosen for this overview prove to have all of the basic features required for ML or DL project development.
The R language was created for solving statistical problems and is very popular among data analysts. Its main limitation is that it is poorly suited for solving problems not related to data analytics and visualization.
Additionally, R has issues with scalability. It’s a single-threaded language that runs in RAM, so it’s memory-constrained, while Python has full support for multi-threading and doesn’t have memory issues.
Python is a general-purpose language and can be successfully applied to various ML and DL tasks. That explains why Python has gained tremendous popularity and has become lingua franca in the machine learning community over the past couple of years.
Today, any modern ML or DL library provides Python API, which makes decision making much easier: just use Python and you’ll be good! It’s a very simple and easy-to-learn language.
The language itself is very simple and therefore easy to learn. Furthermore, you don’t have to know Python from inside out to be able to experiment with ML and create ML algorithms.
According to Python Anywhere, there are between 2,8 and 4 million Python developers available globally. As per Stack Overflow, Python is the fastest-growing major programming language of the past years.
From Q3 2017 to Q2 2018, Python Developer jobs have almost doubled all over the world, which should give you a better understanding of how popular the language is when it comes to ML development.
In traditional software development, you spend most of your programming time in text editors or IDE, while in Data Science, most of the code is written in Jupyter Notebook.
This is a simple and powerful data analysis tool that allows you to write code in Python, R and other languages, add text descriptions to Markdown, as well as embed graphs and charts directly into an interactive web page.
In addition, Google has recently released the free Google Colab service, which is a cloud version of Jupyter Notebook that provides an opportunity to perform calculations on the CPU and GPU. It has all of the necessary Python ML libraries already installed so you can start right there if you are too lazy to install anything locally.
Scikit-learn is one of the most popular ML libraries today. It supports most of ML algorithms, both supervised and unsupervised: linear and logistic regression, support vector machine (SVM), Naive Bayes classifier, gradient boosting, k-means clustering, KNN, and many others.
In addition, Scikit-learn contains many useful tools for preparing data and analyzing results. This library is mainly intended for classical machine learning algorithms, therefore its functionality for neural networks is very limited, and for deep learning tasks it cannot be used at all.
Data analysis and preparation often takes most of the time when solving ML problems. The data can be obtained in CSV, JSON, Excel or other structured (or not so) format, and you need to process them in order to use them in ML models.
For these purposes, the Pandas library is used. It is a powerful tool that allows you to quickly analyze, modify and prepare data for future use in other ML and DL libraries, such as Scikit-learn, TensorFlow or PyTorch.
In Pandas, you can download data from various sources such as SQL databases, CSV, Excel, JSON files and other less popular formats.
When data is loaded into memory, many different operations can be performed with it for analyzing, transforming, filling in missing values and clearing the data set. Pandas allows you to perform many SQL-like operations on data sets: aggregation, grouping, etc. It also provides a built-in set of popular statistical functions for basic analysis.
Jupyter Notebook also supports Pandas and implements a beautiful visualization of its data structures. The Pandas website contains very detailed documentation. But you can start with a 10-minute tutorial that walks you through all of the main library features.
The main functionality of NumPy is to support multidimensional data arrays and fast linear algebra algorithms. That is why NumPy is a key component of Scikit-learn, SciPy, and Pandas.
Usually, NumPy is used as an auxiliary library for performing various mathematical operations with Pandas data structures, so it’s worth exploring its basic capabilities.
An introductory tutorial on Numpy, as well as the basics of NumPy, are just perfect for this.
Matplotlib and Seaborn
Matplotlib is a standard tool in the data engineer’s suite. It allows you to create a variety of graphs and charts to visualize the results of data analytics.
Graphics created in Matplotlib can be easily integrated into Jupyter Notebook. This makes it possible to visualize the data and the results obtained during the processing of models.
Many additional packages have been created for this library. One of the most popular ones is Seaborn. Its main feature is a ready-made set of the most frequently used statistical charts and graphs.
Traditionally, both libraries have a section with tutorials on their sites, but a more efficient approach would be to register on the Kaggle website and look at pre-built examples in the Kernels section (for example, Comprehensive Data Exploration with Python).
Any deep learning library contains three key components: multidimensional arrays (also known as tensors), linear algebra operators, and derivative calculations.
In Tensorflow, Google’s deep learning library, all three components are perfectly implemented. Along with the CPU, it supports computing on GPU and TPU (Google tensor processors).
Currently, it is the most popular DL library that helped create many tutorials and online courses on DL. Yet, this mature library has a downside – a very clumsy API and a higher entry threshold, compared to PyTorch.
Keras is an add-on to Tensorflow, which solves many of the usability issues. Its main feature is the ability to build a neural network architecture using Python DSL. There are also a lot of training materials written for Keras, so it’s easy to deal with.
PyTorch is the second most popular DL library after Tensorflow. Created by Facebook and designed specifically for Python, it uses standard idioms.
Compared to Tensorflow, the entry threshold is much lower here, and any neural network can be built using standard OOP classes and objects.
PyTorch is also easier to debug because the code is executed as a regular Python code – there is no compilation step, as in TensorFlow. Therefore, you can easily use the Python debugger here.
Compared to Keras, PyTorch is more verbose, but less magical. PyTorch also has its own add-on – the fast.ai library. It allows you to fulfill most of the standard DL tasks with just a couple of lines of code. But what makes fast.ai really special is their incredible Practical Deep Learning for Coders online course.
Languages and libraries reviewed above are perfectly suited for rookies seeking to master their ML and DL skills since they contain good user guides and online courses to learn from.
To make the learning process smoother, it makes sense to start experimenting with classic ML tasks and focus on using Scikit-learn and Pandas first and move towards deep learning basics and tools afterward.