10 Essential Data Science Packages for Python
Monday, Nov 25, 2019 17:10 · 1289 words · 7 minutes read
Interest in data science has risen remarkably in the last five years. And while there are many programming languages suited for data science and machine learning, Python is the most popular.
Since it’s the language of choice for machine learning, here’s a Python-centric roundup of ten essential data science packages, including the most popular machine learning packages.
Scikit-Learn
Scikit-Learn is a Python module for machine learning built on top of SciPy and NumPy. David Cournapeau started it as a Google Summer of Code project. Since then, it’s grown to over 20,000 commits and more than 90 releases. Companies such as J.P. Morgan and Spotify use it in their data science work.
Because Scikit-Learn has such a gentle learning curve, even the people on the business side of an organization can use it. For example, a range of tutorials on the Scikit-Learn website show you how to analyze real-world data sets. If you’re a beginner and want to pick up a machine learning library, Scikit-Learn is the one to start with.
Here’s what it requires:
- Python 3.5 or higher.
- NumPy 1.11.0 or higher.
- SciPy 0.17.0 or higher.
PyTorch
PyTorch does two things very well. First, it accelerates tensor computation using strong GPU. Second, it builds dynamic neural networks on a tape-based autograd system, thus allowing reuse and greater performance. If you’re an academic or an engineer who wants an easy-to-learn package to perform these two things, PyTorch is for you.
PyTorch is excellent in specific cases. For instance, do you want to compute tensors faster by using a GPU, as I mentioned above? Use PyTorch because you can’t do that with NumPy. Want to use RNN for language processing? Use PyTorch because of its define-by-run feature. Or do you want to use deep learning but you’re just a beginner? Use PyTorch because Scikit-Learn doesn’t cater to deep learning.
Requirements for PyTorch depend on your operating system. The installation is slightly more complicated than, say, Scikit-Learn. I recommend using the “Get Started” page for guidance. It usually requires the following:
- Python 3.6 or higher.
- Conda 4.6.0 or higher.
Caffe
Caffe is one of the fastest implementations of a convolutional network, making it ideal for image recognition. It’s best for processing images.
Yangqing Jia started Caffe while working on his PhD at UC Berkeley. It’s released under the BSD 2-Clause license, and it’s touted as one of the fastest-performing deep-learning frameworks out there. According to the website, Caffe’s image processing is quite astounding. They claim it can process “over 60M images per day with a single NVIDIA K40 GPU.”
I should highlight that Caffe assumes you have at least a mid-level knowledge of machine learning, although the learning curve is still relatively gentle.
As with PyTorch, requirements depend on your operating system. Check the installation guide here. I recommend using the Docker version if you can so it works right out of the box. The compulsory dependencies are below:
- CUDA for GPU mode.
Library version 7 or higher and the latest driver version are recommended, but releases in the 6s are fine too. Versions 5.5 and 5.0 are compatible but considered legacy.
- BLAS via ATLAS, MKL, or OpenBLAS.
- Boost 1.55 or higher.
TensorFlow
TensorFlow is one of the most famous machine learning libraries for some very good reasons. It specializes in numerical computation using dataflow graphs.
Originally developed by Google Brain, TensorFlow is open sourced. It uses dataflow graphs and differentiable programming across a range of tasks, making it one of the most highly flexible and powerful machine learning libraries ever created.
If you need to process large data sets quickly, this is a library you shouldn’t ignore.
The most recent stable version is v1.13.1, but the new v2.0 is in beta now.
Theano
Theano is one of the earliest open-source software libraries for deep-learning development. It’s best for high-speed computation.
While Theano announced that it would stop major developments after the release of v1.0 in 2017, you can still study it for historical reasons. It’s made this list of top ten data science packages for Python because if you familiarize yourself with it, you’ll get a sense of how its innovations later evolved into the features you now see in competing libraries.
Pandas
Pandas is a powerful and flexible data analysis library written in Python. While not strictly a machine learning library, it’s well-suited for data analysis and manipulation for large data sets. In particular, I enjoy using it for its data structures, such as the DataFrame, the time series manipulation and analysis, and the numerical data tables. Many business-side employees of large organizations and startups can easily pick up Pandas to perform analysis. Plus, it’s fairly easy to learn, and it rivals competing libraries in terms of its features in data analysis.
If you want to use Pandas, here’s what you’ll need:
- Setuptools version 24.2.0 or higher.
- NumPy version 1.12.0 or higher.
- Python dateutil 2.5.0 or higher.
- pytz for cross-platform timezone calculations.
Keras
Keras is built for fast experimentation. It’s capable of running on top of other frameworks like TensorFlow, too. Keras is best for easy and fast prototyping as a deep learning library.
Keras is popular amongst deep learning library aficionados for its easy-to-use API. Jeff Hale created a compilation that ranked the major deep learning frameworks, and Keras compares very well.
The only requirement for Keras is one of three possible backend engines, like TensorFlow, Theano, or CNTK.
NumPy
NumPy is the fundamental package needed for scientific computing with Python. It’s an excellent choice for researchers who want an easy-to-use Python library for scientific computing. In fact, NumPy was designed for this purpose; it makes array computing a lot easier.
Originally, the code for NumPy was part of SciPy. However, scientists who need to use the array object in their work were having to install the large SciPy package. To avoid that, a new package was separated from SciPy and called NumPy.
If you want to use NumPy, you’ll need Python 2.6.x, 2.7.x, 3.2.x, or newer.
Matplotlib
Matplotlib is a Python 2D plotting library that makes it easy to produce cross-platform charts and figures.
So far in this roundup, we’ve covered plenty of machine learning, deep learning, and even fast computational frameworks. But with data science, you also need to draw graphs and charts. When you talk about data science and Python, Matplotlib is what comes to mind for plotting and data visualization. It’s ideal for publication-quality charts and figures across platforms.
For long-term support, the current stable version is v2.2.4, but you can get v3.0.3 for the latest features. It does require that you have Python 3 or newer, since support for Python 2 is being dropped.
SciPy
SciPy is a gigantic library of data science packages mainly focused on mathematics, science, and engineering. If you’re a data scientist or engineer who wants the whole kitchen sink when it comes to running technical and scientific computing, you’ve found your match with SciPy.
Since it builds on top of NumPy, SciPy has the same target audience. It has a wide collection of sub packages, each focused on niches such as Fourier transforms, signal processing, optimizing algorithms, spatial algorithms, and nearest neighbor. Essentially, this is the companion Python library for your typical data scientist.
As far as requirements go, you’ll need NumPy if you want SciPy. But that’s it.
Summary
This brings to an end my roundup of the 10 major data-science-related Python libraries. Is there something else you’d like us to cover that also uses Python extensively? Let us know!
And don’t forget that Kite can help you learn these packages faster with its ML-powered autocomplete as well as handy in-editor docs lookups. Check it out for free as an IDE plugin for any of the leading IDEs.