Build Pipelines with Pandas Using pdpipe
Posted by Armando Brito Mendes | Filed under linguagens de programação, software
Boa descrição de pipelines com os data.frame do Pandas.
Pandas is an amazing library in the Python ecosystem for data analytics and machine learning. They form the perfect bridge between the data world, where Excel/CSV files and SQL tables live, and the modeling world where Scikit-learn or TensorFlow perform their magic.
A data science flow is most often a sequence of steps — datasets must be cleaned, scaled, and validated before they can be ready to be used by that powerful machine learning algorithm.
These tasks can, of course, be done with many single-step functions/methods that are offered by packages like Pandas but a more elegant way is to use a pipeline. In almost all cases, a pipeline reduces the chance of error and saves time by automating repetitive tasks.
In the data science world, great examples of packages with pipeline features are — dplyr in R language, and Scikit-learn in the Python ecosystem.
A data science flow is most often a sequence of steps — datasets must be cleaned, scaled, and validated before they can be ready to be used
Following is a great article about their use in a machine-learning workflow.
Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction
Are you familiar with Scikit-learn Pipelines? They are an extremely simple yet very useful tool for managing machine…
Pandas also offer a .pipe
method which can be used for similar purposes with user-defined functions. However, in this article, we are going to discuss a wonderful little library called pdpipe, which specifically addresses this pipelining issue with Pandas DataFrame.
In almost all cases, a pipeline reduces the chance of error and saves time by automating repetitive tasks
Tags: data mining, pandas, pipelines, Python