Pandas is a popular library for data analysis built on top of the Python programming language. Pandas can be though as a digital toolbox that holds various tools for working with data. Pandas pairs well with other libraries for statistics, natural language processing, machine learning, visualization, and more. Pandas is comparable to Microsoft’s spreadsheet software Excel or Google’s inbrowser Sheets application. All three of these options allow users to interact with single- or multi-column collections of data organized in a tabular (grid-like) shape. Data can be sorted, filtered, aggregated, split, pivoted, summarized, and more
Pandas was initially developed by data scientist Wes McKinney while working at New York’s AQR Capital Management investment firm in 2008. Dissatisfied with both Excel and the statistical programming language R, McKinney searched for a solution that would make it easy to solve frequently encountered problems in the financial industry, particularly data cleanup and aggregation
Pandas has seen continual, extensive growth since its release to the public in December 2009. User counts are estimated to be between 5 and 10 million12. As of April 2020, pandas has been downloaded over 280 million times from PyPi, the centralized online repository of Python packages
Importing a Dataset
The best way to see the power of pandas is to observe it in action! So let’s take a quick tour of the library by analyzing a dataset of the “Forecasts for Product Demand”. The goal here is to get a birds-eye overview of the features and functionalities of the library. Don’t worry too much about the nitty-gritty details; we’ll dive into all these topics in greater depth later.
Our first step is to import the pandas library to get access to its features. We’ll assign it to the popular community alias pd
The square box to the left of the code marks the cell execution order relative to the launch or restart of the Notebook. Cells can be executed in any order and any cell can be executed multiple times.
Pandas can import a variety of file types include xlsx (Excel), sql (SQL), and hdf (Hierarchical Data Format). Each file type has its associated import method available at the top-level of the library. Think of a method as a command that we can issue, either to the library or an entity within it. For now, we’ll use the read_csv method here to tell Pandas to open up the csv file
Manipulating a DataFrame
Pandas imports the CSV file’s contents into an object called a DataFrame, a two-dimensional grid. There are a variety of perspectives from which we can look at the dataset. We can ask pandas for a few rows from the start of the DataFrame by specifying the numeber of row in “head”
We can find out how many rows are in the DataFrame
We can ask for the shape of the DataFrame (the number of rows and columns) as well as the total number of cell values contained within it
We can extract a row from the dataset by its numeric order in line, which is also called its index position. In most programming languages, the index starts counting at 0 instead of 1. So if we wanted to pull out the 500th movie in the list, we would target index position 500
Pandas returns a new object here called a Series. A Series stores a single column of values. Like a DataFrame, a Series contains an index. Notice that the index labels for the Series (Product_Code, Warehouse, Product_Category, Date, Order_Demand) are the column names from the original demand data-frame
As it currently stands, the dataset is sorted by the values in the Rank column. What if we wanted to see the five entries on the list with the most recent release date? We can sort a DataFrame by a single column, such as
Interestingly the type of order_demand is not numeric so we might need to clean this data and map it to the numeric form
Counting Values in a Series
Let’s try a more sophisticated analysis! What if we wanted to find out the number of products belonged to each warehouse
Filtering a Column by One or More Criteria
A common operation in data analysis is extracting a subset of rows from an original dataset based on one or more criteria. Microsoft Excel offers the Filter tool for this exact purpose. Let’s say we wanted to limit our demand to only product_1687. We can accomplish this in pandas with one line of code.
Our next challenge is the most complex one yet. Let’s find out the aggregate total of Order_Demand for each product
In order to get the correct answer you need to perform the cleaning and mapping type for the Oder_Demand