Python continues to lead in solving data science tasks and challenges. Last year, we published a blog post Top 15 Python Libraries for Data Science in 2017, which outlined the Python libraries that had proved most helpful at the time. This year, we expanded this list, added new Python libraries, and revisited the Python libraries that were discussed last year, focusing on updates made during the year.
Our selection actually includes more than 20 libraries, because some of these libraries are interchangeable and can solve the same problem. Therefore, we put them in the same group.
â–ŒCore library and statistics
1. NumPy (Commits: 17911, Contributors: 641)
Official website: http: //
NumPy is one of the main software packages of the scientific application library for processing large multi-dimensional arrays and matrices. Its large collection of advanced mathematical functions and implementation methods make it possible for these objects to perform operations.
2. SciPy (Commits: 19150, Contributors: 608)
Official website: https://scipy.org/scipylib/
Another core library of scientific computing is SciPy. It is based on NumPy, and its functions have been expanded. SciPy main data structure is a multi-dimensional array, implemented by Numpy. This package contains tools to help solve linear algebra, probability theory, integral calculations, and many other tasks. In addition, SciPy also encapsulates many new BLAS and LAPACK functions.
3. Pandas (Commits: 17144, Contributors: 1165)
Official website: https://pandas.pydata.org/
Pandas is a Python library that provides advanced data structures and various analysis tools. The main feature of this software package is the ability to convert fairly complex data operations into one or two commands. Pandas includes many built-in methods for grouping, filtering, and combining data, as well as time series functions.
4. StatsModels (Commits: 10067, Contributors: 153)
Official website: http: //
Statsmodels is a Python module that provides many opportunities for statistical data analysis, such as statistical model estimation and performing statistical tests. With its help, you can implement many machine learning methods and explore different drawing possibilities.
The Python library is constantly evolving, constantly enriching new opportunities. Therefore, there have been improvements in time series and new counting models this year, namely Generalized Poisson, zero inflated models and NegativeBinomialP, as well as new multivariate methods: factor analysis, multivariate analysis of variance, and repeated measures in analysis of variance.
â–ŒVisualization
5. Matplotlib (Commits: 25747, Contributors: 725)
Official website: https://matplotlib.org/index.html
Matplotlib is a low-level library for creating 2D graphs and graphics. With its help, you can build a variety of different icons, from histograms and scatter plots to Fécart coordinate plots. In addition, there are many popular plot libraries designed to be used in conjunction with matplotlib.
6. Seaborn (Commits: 2044, Contributors: 83)
Official website: https://seaborn.pydata.org/
Seaborn is essentially a high-level API based on the matplotlib library. It contains default settings that are more suitable for processing charts. In addition, there are a wealth of visualization libraries, including some complex types, such as time series, joint distribution diagrams (jointplots) and violin diagrams (violin diagrams).
7. Plotly (Commits: 2906, Contributors: 48)
Official website: https://plot.ly/python/
Plotly is a popular library that allows you to easily build complex graphics. The software package is suitable for interactive Web applications, and can achieve visual effects such as contour diagrams, ternary diagrams and 3D diagrams.
8. Bokeh (Commits: 16983, Contributors: 294)
Official website: https://bokeh.pydata.org/en/latest/
The Bokeh library uses JavaScript widgets to create interactive and scalable visualizations in the browser. The library provides a variety of chart collections, styling possibilities, link graphs, adding widgets, defining callbacks, and other forms of interactive capabilities, as well as many more useful features.
9. Pydot (Commits: 169, Contributors: 12)
Official website: https://pypi.org/project/pydot/
Pydot is a library for generating complex directed and undirected graphs. It is a Graphviz interface written in pure Python. With its help, the structure of the graph can be displayed, which is often used when building neural networks and algorithms based on decision trees.
â–ŒMachine learning
10. Scikit-learn (Commits: 22753, Contributors: 1084)
Official website: http://scikit-learn.org/stable/
This Python module based on NumPy and SciPy is one of the best libraries for processing data. It provides algorithms for many standard machine learning and data mining tasks, such as clustering, regression, classification, dimensionality reduction, and model selection.
Improve your skills with Data Science School
Data Science School: http://datascience-school.com/
11. XGBoost / LightGBM / CatBoost (Commits: 3277/1083/1509, Contributors: 280/79/61)
Official website:
http://xgboost.readthedocs.io/en/latest/
http://lightgbm.readthedocs.io/en/latest/Python-Intro.html
https://github.com/catboost/catboost
Gradient enhancement algorithm is one of the most popular machine learning algorithms, it is to build a constantly improving basic model, namely decision tree. Therefore, in order to realize this method quickly and conveniently, a special library is designed. That said, we think XGBoost, LightGBM and CatBoost deserve special attention. They are all competitors who solve common problems and use them in almost the same way. These libraries provide highly optimized, scalable, and fast gradient enhancement implementations, which makes them very popular among data scientists and Kaggle competitors because they have won many games with the help of these algorithms.
12. Eli5 (Commits: 922, Contributors: 6)
Official website: https://eli5.readthedocs.io/en/latest/
Usually, the prediction results of machine learning models are not completely clear, which is exactly the challenge Eli5 helps to cope with. It is a software package for visualizing and debugging machine learning models and gradually tracking algorithm work. It supports the scikit-learn, XGBoost, LightGBM, lightning, and sklearn-crfsuite libraries, and performs different tasks for each library.
â–ŒDeep learning
13. TensorFlow (Commits: 33339, Contributors: 1469)
Official website: https: //
TensorFlow is a popular deep learning and machine learning framework developed by Google Brain. It provides the ability to use artificial neural networks with multiple data sets. Among the most popular TensorFlow applications are target recognition and speech recognition. There are also different leyer-helpers on regular TensorFlow, such as tflearn, tf-slim, skflow, etc.
14. PyTorch (Commits: 11306, Contributors: 635)
Official website: https://pytorch.org/
PyTorch is a large framework that allows GPU acceleration to perform tensor calculations, create dynamic calculation graphs and automatically calculate gradients. On top of this, PyTorch provides a rich API for solving applications related to neural networks. The library is based on Torch and is an open source deep learning library implemented in C.
15. Keras (Commits: 4539, Contributors: 671)
Official website: https://keras.io/
Keras is a high-level library for processing neural networks. It runs on TensorFlow and Theano. Now, due to the release of a new version, CNTK and MxNet can also be used as backends. It simplifies many specific tasks and greatly reduces the amount of monotonous code. However, it may not be suitable for some complex tasks.
â–ŒDistributed deep learning
16. Dist-keras / elephas / spark-deep-learning (Commits: 1125/170/67, Contributors: 5/13/11)
Official website:
http://joerihermans.com/work/distributed-keras/
https://pypi.org/project/elephas/
https://databricks.github.io/spark-deep-learning/site/index.html
As more and more use cases require a lot of energy and time, deep learning problems become more and more important. However, with a distributed computing system like Apache Spark, it is much easier to process so much data, which once again expands the possibility of deep learning. Therefore, dist-keras, elephas, and spark-deep-learning are rapidly gaining popularity and development, and it is difficult to pick a library because they are all designed to solve common tasks. These packages allow you to directly train neural networks based on the Keras library with the help of Apache Spark. Spark-deep-learning also provides tools for creating pipelines using Python neural networks.
â–ŒNatural language processing
17. NLTK (Commits: 13041, Contributors: 236)
Official website: https: //
NLTK is a set of libraries, a complete platform for natural language processing. With the help of NLTK, you can process and analyze the text in various ways, mark and mark the text, extract information, etc. NLTK is also used for prototyping and establishing research systems.
18. SpaCy (Commits: 8623, Contributors: 215)
Official website: https://spacy.io/
SpaCy is a natural language processing library with excellent examples, API documentation, and demo applications. This library is written in Cython language, Cython is a C extension of Python. It supports nearly 30 languages, provides simple deep learning integration, and guarantees robustness and high accuracy. Another important feature of SpaCy is the architecture designed for the entire document processing, without the need to break the document into phrases.
19. Gensim (Commits: 3603, Contributors: 273)
Official website: https://radimrehurek.com/gensim/
Gensim is a Python library for robust semantic analysis, topic modeling and vector space modeling, built on top of Numpy and Scipy. It provides the implementation of popular NLP algorithms, such as word2vec. Although gensim has its own models.wrappers.fasttext implementation, the fasttext library can also be used to efficiently learn word representations.
â–ŒData collection
20. Scrapy (Commits: 6625, Contributors: 281)
Official website: https://scrapy.org/
Scrapy is a library for creating web crawlers, scanning web pages and collecting structured data. In addition, Scrapy can extract data from the API. Due to the scalability and portability of the library, it is very convenient to use.
▌ Conclusion
The above listed in this article is our rich collection of Python libraries in the field of data science in 2018. Compared with the previous year, some new modern libraries are becoming more and more popular, and those that have become classic data science tasks are also constantly improving.
The following table shows detailed statistics of GitHub activity:
Features
â—† Designed For Water and Dust Tight(IP67)
â—† Small Compact Sizeâ—† UL&ENEC&CQC Safety Approvals
â—† Long life & high reliability
â—† Variety of Levers
â—† Wide Range of wiring Terminals
â—† Wide used in Automotive Electronics,Appliance and Industrial Control etc.
â—† Customized Designs
Ip67 Micro Switch,Micro Push Switch,Ip67 Sealed Micro Switch,Waterproof Micro Limit Switch
Ningbo Jialin Electronics Co.,Ltd , https://www.donghai-switch.com