Introduction to Pandas for Data Science

What is Pandas?

If you wonder where the name comes from, unfortunately, it is not because the creators liked pandas as a species so much — it is a combination of panel data which has roots in econometry and Python data analysis.

Photo by William Iven on Unsplash
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

Popularity of Pandas

As we learned, Python is the most popular programming language for data analytics, and many of the popular machine learning and visualization libraries are written in Python, including Pandas, Numpy, TensorFlow, Matplotlib, Scikit-learn, and more. In fact, Python ranked 4th in the 2020 StackOverflow survey for the most popular programming language, and it is beloved for its simplicity, easy learning-curve, and improved library support.

First Step: Installing Pandas

You can install Pandas using the built-in Python tool pip and run the following command in your Python IDE.

$ pip install pandas

Pandas Data Structures and Data Types

A data type is like an internal construct that determines how Python will manipulate, use, or store your data. When doing data analysis, it’s important to use the correct data types to avoid errors. Pandas will often correctly infer data types, but sometimes, we need to explicitly convert data. Let’s go over the data types available to us in Pandas, also called dtypes.

  • int64: integer numbers
  • bool: true/false vaues
  • float64: floating point numbers
  • category: finite list of text values
  • datetime64: Date and time values
  • timedelta[ns]: differences between two datetimes
  • DataFrame

Series: the most important operations

We can get started with Pandas by creating a series. We create series by invoking the pd.Series() method and then passing a list of values. We print that series using the print statement. Pandas will, by default, count index from 0. We then explicitly define those values.

series1 = pd.Series([1,2,3,4])print(series1)

Assign names to our values

Pandas will automatically generate our indexes, so we need to define them. Each index corresponds to its value in the Series object. Let’s look at an example where we assign a country name to population growth rates.

Select entries from a Series

To select entries from a Series, we select elements based on the index name or index number.

  • On line 12, the element is selected based on the index number. Keep in mind that index numbers start from 0.
  • On line 15, multiple elements are selected from the Series by selecting multiple index names inside the [].

Drop entries from a Series

Dropping and unwanted index is a common function in Pandas. If the drop(index_name) function is called with a given index on a Series object, the desired index name is deleted.

  • name : str, optional gives a name to the Series
  • copy : bool, default False allows us to copy data we input
  • The notnull() function will return a series object with indexes assigned to False (for NaN or null values), and the remaining indexes are assigned True
  • and much more

DataFrame: the most important operations

There are several ways to make a DataFrame in Pandas. The easiest way to create one from scratch is to create and print a df.

data = {
'peppers': [3, 2, 0, 1],
'carrots': [0, 3, 7, 2]
}
quantity = pd.DataFrame(data)quantity
quantity = pd.DataFrame(data, index=['June', 'July', 'August', 'September'])quantity

Get info about your data

One of the first commands you run after loading your data is .info(), which provides all the essential information about a dataset.

Searching and selecting in our DataFrame

We also need to know how to manipulate or access the data in our DataFrame, such as selecting, searching, or deleting data values. You can do this either by column or by row. Let’s see how it’s done. The easiest way to select a column of data is by using brackets [ ]. We can also use brackets to select multiple columns. Say we only wanted to look at June’s vegetable quantity.

quantity.loc['June']

Create a new DataFrame from pre-existing columns

We can also grab multiple columns and create a new DataFrame object from it.

Create a new DataFrame using API

We first need to understand what all information can be accessed from the API. For that we use the example of the channel Free Code Camp to make the API call and check the information we get.

Create the dataset

Now that we are aware of what to expect from the API response, let’s start with compiling the data together and creating our dataset. For this blog, we’ll consider a list of channels that I collected online.

dataset.sample(5)
  1. None/Null/Blank Values: Some of the rows will have missing values. In such cases, we’ll have two options. We can either remove the complete row where any value is blank or we can input some carefully selected value in the blank spaces. Here, the status column will have None in some cases. We’ll remove these rows by using the method dropna(axis = 0, how = 'any', inplace = True) which drops rows with blank values in the dataset itself. Then, we change the index of the numbers from 0 to the length of the dataset using the method RangeIndex(len(dataset.index)).
Add column headings and update index

Export Dataset

Our dataset is now ready, and can be exported to an external file. We use the to_csv() method. We define two paramteres. The first parameter refers to the name of the file. The second parameter is a boolean that represents if the first column in the exported file will have the index or not. We now have a .CSV file with the dataset we created.

Dataset.csv

Reindex data in a DataFrame

We can also reindex the data either by the indexes themselves or the columns. Reindexing with reindex() allows us to make changes without messing up the initial setting of the objects.

How to read or import Pandas data

It is quite easy to read or import data from other files using the Pandas library. In fact, we can use various sources, such as CSV, JSON, or Excel to load our data and access it. Let’s take a look at one of example.

Reading and importing data from CSV files

We can import data from a CSV file, which is common practice for Pandas users. We simply create or open our CSV file, copy the data, paste it in our Notepad, and save it in the same directory that houses your Python scripts. You then use a bit of code to read the data using the read_csv function build into Pandas.

import pandas as pd
data = pd.read_csv('vegetables.csv')
print(data)
data = pd.read_csv("data.csv", index_col=0)
df.to_csv('new_vegetables.csv')

Data Wrangling with Pandas

Once we have our data, we can use data wrangling processes to manipulate and prepare data for the analysis. The most common data wrangling processes are merging, concatenation, and grouping. Let’s get down the basics of each of those.

Merging with Pandas

Merging is used when we want to collect data that shares a key variable but are located in different DataFrames. To merge DataFrames, we use the merge() function. Say we have df1 and df2.

import pandas as pdd = {
'subject_id': ['1', '2', '3', '4', '5'],
'student_name': ['Mark', 'Khalid', 'Deborah', 'Trevon', 'Raven']
}
df1 = pd.DataFrame(d, columns=['subject_id', 'student_name'])
print(df1)
import pandas as pddata = {
'subject_id': ['4', '5', '6', '7', '8'],
'student_name': ['Eric', 'Imani', 'Cece', 'Darius', 'Andre']
}
df2 = pd.DataFrame(data, columns=['subject_id', 'student_name'])
print(df2)
pd.merge(df1, df2, on='subject_id')

Grouping with Pandas

Grouping is how we categorize our data. If a value occurs in multiple rows of a single column, the data related to that value in other columns can be grouped together. Just like with merging, it’s more simple than it sounds. We use the groupby function. Look at this example.

Concatenation

Concatenation is a long word that means to add a set of data to another. We use the concat() function to do so. To clarify the difference between merge and concatenation, merge() combines data on shared columns, while concat() combines DataFrames across columns or rows.

print(pd.concat([df1, df2]))
  • Finding outliers in data
  • Data Aggregation
  • Reshaping data
  • Replace & rename
  • and more

Passionate about ML