Table of Contents

1. Introduction: What is Pandas?

Pandas is a powerful Python library for working with structured financial data. It provides intuitive data structures—like Series (for single columns of data) and DataFrames (for full tables)—that make it easy to clean, analyze, and transform datasets such as price histories, income statements, trading logs, or economic indicators.

In financial workflows, Pandas is used for a wide range of tasks: loading time series data (e.g., stock prices), calculating returns and volatility, aggregating trades by day or asset, building factor models, and cleaning large sets of transaction data. Analysts use it to automate reporting, build dashboards, backtest trading strategies, and prepare data for statistical modeling or machine learning. It’s particularly well-suited to working with time-indexed data and integrating multiple sources—such as Bloomberg exports, market data feeds, or financial statements.

What makes Pandas so valuable in finance is its ability to handle real-world data: it deals gracefully with missing entries, irregular timestamps, and mixed data types—all common challenges in financial analysis. Whether you’re managing a hedge fund model, analyzing bank performance, or building an earnings forecast, Pandas helps you turn raw financial data into clear, actionable insights.

In this guide, we’ll walk through the basics of getting started with Pandas — from loading and exploring data in your first dataframe, to mid-level topics like grouping, merging, reshaping, and working with time series. Whether you’re a beginner looking to understand the fundamentals or an intermediate user aiming to deepen your skills, this article is designed to be your practical companion in learning how to think—and code—with Pandas.

Image Source: https://pandas.pydata.org

2. Why Use Pandas?

Pandas gives you a powerful, Python-based toolkit for working with structured data.

It’s a library that allows you to load data from external sources like CSVs, Excel, SQL databases, or APIs, then quickly explore, clean, and transform it. With just a few lines of code, you can filter transactions, calculate financial metrics like returns or profit margins, and organize your data into meaningful groups, such as revenue by region or cost per product.

Once your data is structured, Pandas helps you analyze and summarize it. You can sort, rank, and reshape datasets, apply custom functions, or build pivot tables to get deeper insights. Its grouping and aggregation tools make it easy to compute KPIs across time periods, client segments, or asset classes. For financial time series, Pandas offers built-in support for datetime indexing, resampling, and rolling statistics, enabling accurate modeling and historical analysis.

Pandas also integrates seamlessly with other libraries like Matplotlib for quick visualizations, and scikit-learn for machine learning workflows. Whether you’re preparing an earnings report, building a forecasting model, or analyzing trades, Pandas lets you move from raw data to insight efficiently—making it an essential tool for modern financial analysis.

3. Core Concepts and Data Structures

To use Pandas effectively, you need to first understand its two fundamental building blocks: Series and DataFrames.

  • A Series is like a single column of labeled data. You can think of it as a list with names attached—each value has an associated label called an index. This makes it easy to refer to values not just by position, but by a meaningful key (like a date, a stock ticker, or a customer ID). Series are great for 1D data—like a time series of prices or a list of countries with population counts.

  • A DataFrame is a table made of multiple Series. Each column is its own Series, and all the columns are aligned by their index (row labels). This two-dimensional structure is what makes Pandas so powerful for real-world datasets. DataFrames let you work with full tables of information—like balance sheets, trading logs, or multi-column reports—while preserving the relationships between variables.

Series and DataFrames are used to import, hold, and work with data in Pandas. Conceptually, they’re the foundation of everything you do with the library, whether you’re cleaning messy inputs, analyzing trends, or preparing data for machine learning.

3.1 What is a Series in Pandas

A Series in Pandas is like a column in a spreadsheet. It consists of:
  • A set of values (the data)
  • A corresponding index (the labels for each value)
Image Source: https://pandas.pydata.org

Example of a Simple Series:

Creating a series in Pandas is as simple as calling the Series() constructor function.

To do that; importing pandas, declaring an array of data, and its corresponding indices.

    
     import pandas as pd

s = pd.Series([100, 200, 300], index=['AAPL', 'MSFT', 'GOOG'])

print(s)
    
   

Output

Calling print(s) results in the following series being printed to the output area.

    
     AAPL    100
MSFT    200
GOOG    300
dtype: int64
    
   

Here’s what’s happening conceptually:

  • pd.Series refers to the Series class or constructor function from the Pandas library.

  • When you write pd.Series(...), you are creating a new Series object by calling that constructor and passing it:

    • A list of values: [100, 200, 300]

    • An index (labels for each value): ['AAPL', 'MSFT', 'GOOG']

3.2 What is a DataFrame in Pandas

The DataFrame is Pandas’ most important data structure.  DataFrames are 2D labeled data – and can be thought about similar to a table with rows and columns, where:
  • Each column is a Series
  • Each row has a unique index
Image Source: https://pandas.pydata.org

Example:

    
     data = {
    'Company': ['AAPL', 'MSFT', 'GOOG'],
    'Price': [100, 200, 300],
    'Sector': ['Tech', 'Tech', 'Tech']
}

df = pd.DataFrame(data)
print(df)
    
   

Output

    
       Company  Price Sector
0    AAPL    100   Tech
1    MSFT    200   Tech
2    GOOG    300   Tech
    
   

3.4 The Many Ways to create a DataFrames

One of the most powerful features of Pandas is the ability to create a DataFrame out of a variety of very common formats that financial data comes in.

You can create a DataFrame from:

  • A dictionary of lists or Series (most common), or list of dictionaries

  • CSV, Excel, SQL, or JSON file

  • A NumPy array

Example – From a CSV:

Take, for example, the process of turning a CSV file into a DataFrame for analysis in Pandas.

    
     df = pd.read_csv('stocks.csv')
    
   

Yes! It’s that simple.

Pandas will read your CSV, and translate your Excel into an indexed group of series, ready to perform data analysis on. 

Example – From SQL:

Pandas also supports many other formats. For example, MySQL (as well as SQL operations) are fully supported, which allows analysts to create and take slices of larger datasets, and shape a dataframe from a larger database. 

    
     import sqlite3
conn = sqlite3.connect('mydb.sqlite')
df = pd.read_sql_query("SELECT * FROM stocks", conn)
    
   

3.5 Starting to Explore your DataFrame

We’ve imported our CSV, and we have our first DataFrame – now what? Our first step in working with dataframes is to understand the most common “worker-functions” that will allow us to quickly explore, gut-check, and understand our data as we are working with it.

Here are several common functions, in no particular order:

Function Purpose
.head(n) View first n rows
.tail(n) View last n rows
.info() Summary of types, nulls, and memory usage
.describe() Descriptive statistics for numeric columns
.shape Get (rows, columns)

.head(n)

.head returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

.tail(n)

.tail returns the last n rows for the object based on position. Similar to head it is useful for quickly testing if operations you performed on your dataframe worked correctly (eg. after sorting, appending, or cleaning rows.)

.info()

.info is a function that provides meta data about your DataFrame. It prints information about a DataFrame including the index dtype and columns, non-null values and memory usage. The memory usage element is particulary useful when working with large datasets.

.describe()

.describe() is a function that provides a group of baseline  statistics about the data in a series or DataFrame. What. The describe () function will return different results based on whether you are describing a series or a DataFrame, and the contents of your data (numeric data, categorical data, timestamp data, etc.).

.shape

.shape is a function that returns the number of rows and columns of your active DataFrame. Again, a very useful function as you merge, join, truncate, filter, and sort data.

4. Essential Pandas Operations

Now that you understand the two core data structures in Pandas—Series and DataFrames—it’s time to get practical. This section walks you through the most common operations you’ll perform when working with data in Pandas: loading data, inspecting it, filtering it, transforming it, and preparing it for analysis.

4.1 Selecting and Filtering Data with Boolean Indexing

One of the most common first tasks you’ll need to do in Pandas is to take a DataFrame and select (or filter) your data based on a set of criteria. 

Boolean indexing in Pandas allows you to filter rows in a DataFrame based on a condition that returns either True or False. It is one of the most intuitive and commonly used ways to slice a dataset. For example, you might want to isolate all transactions over $1 million, or all deals in the ‘Healthcare’ sector. You simply pass a condition inside square brackets—such as df[df[‘Revenue’] > 1_000_000]—and Pandas returns only the rows where the condition is true. This method is highly useful in finance for quickly analyzing outliers, thresholds, or any data that meets certain performance or classification criteria.

Selecting Columns with Boolean Indexing

Image Source: https://pandas.pydata.org
    
     df['Revenue']            # Single column (as Series)
df[['Revenue', 'Cost']]  # Multiple columns (as DataFrame)
    
   

Filtering Rows with Boolean Indexing

You can filter rows by boolean operators. (or even functions!)

Image Source: https://pandas.pydata.org
    
     df[df['Revenue'] > 1000]         # Only rows with Revenue > 1000
df[df['Region'] == 'West']       # Only rows where Region is 'West'
    
   

Combining Conditions

Creating slices of data is easy, and combining multiple conditions can allow analysts to create very useful slices of their data.
    
     df[(df['Revenue'] > 1000) & (df['Region'] == 'West')]
    
   

4.2 Indexing and Selection in Pandas

An alternative to filtering a DataFrame by boolean indexing: Pandas offers the ability to filter data by integer based index or label based indexing.

.iloc[] — Integer-position indexing

In a DataFrame, all rows have a numeric index. .iloc[] allows you to access rows by this integer-based index. Starting with the first row as row 0. iloc[] accepts a single integer, a list of integers, or a slice of integers. It allows you to select rows and columns by their integer positions in the DataFrame, which is especially helpful when you’re working with data whose column names aren’t known ahead of time or when you’re dealing with time slices. 

For instance, if you want to get the first five rows and the second and third columns, you would use df.iloc[0:5, [1, 2]]. In finance, .iloc[] can be helpful for tasks like extracting the last 4 quarters of revenue or the first 10 rows in a screening model output, where the row or column positions are predictable and consistent.

For advanced filtering, .iloc[] can also accept arrays and single variable functions.

    
     df.iloc[0]      # First row
df.iloc[1:3]    # Rows 1 and 2
df.iloc[:, 1]   # All rows, second column
    
   

.loc[] — Label-based indexing

Alternatively, Pandas also allows you to access a group of rows and columns by its label(s) using .loc[].

The .loc[] method is used for label-based indexing, letting you select data by row and column names. It’s more descriptive and flexible than boolean indexing or .iloc[], especially when you need to select specific columns based on a condition. 

For example, df.loc[df[‘Region’] == ‘EMEA’, [‘Company’, ‘Revenue’]] would return just the Company and Revenue columns for companies in the EMEA region. It also works well with datetime indexes—like selecting data between two dates—and is often used in financial applications where you’re working with named fields such as ‘EBITDA’, ‘Revenue’, or ‘Close Price’.

    
     df.loc[1]       # Row with index 1
df.loc[0:2]     # Range of rows by label
df.loc[:, 'Price']  # All rows, 'Price' column
    
   

4.3 Sorting and Ranking

Once you have selected and filtered the data in a DataFrame, you can begin with basic manipulations, such as sorting and ranking.

In finance, sorting and ranking are essential tools for prioritizing and analyzing data within a DataFrame. Sorting allows you to reorder your dataset based on values in one or more columns—such as sorting a list of deals by deal size (df.sort_values(‘Deal Size’, ascending=False)) or sorting securities by return. Ranking, on the other hand, assigns a relative position to each row based on a specific metric. For example, you might use df[‘Return Rank’] = df[‘Return’].rank(ascending=False) to identify top-performing assets in a portfolio. This is particularly useful in investment banking and private equity for tasks like identifying the top 10 M&A deals, ranking companies by EBITDA multiples, or preparing league tables. Combined with filtering and aggregation, sorting and ranking turn raw data into actionable insights.

    
     df.sort_values(by='Revenue', ascending=False)   # Sort by column
df.sort_values(by=['Region', 'Revenue'])        # Multi-column sort
df['Rank'] = df['Revenue'].rank(ascending=False) # Add a rank column
    
   

4.6 Reshaping with Pivot and Melt

Pandas makes it easy to pivot, melt, and reshape data from wide to long formats and back.

In finance, reshaping data with pivot and melt is incredibly useful when transforming data between different analytical views. The pivot() method helps convert data from a long format to a wide format—for instance, turning a transaction log into a table where each row is a company and each column is a fiscal quarter with corresponding revenue values. This makes it easier to perform time-series analysis or generate management reports. On the other hand, melt() (also known as unpivoting) is used to convert wide-format data back into long format, which is ideal for running grouped operations, plotting, or exporting to systems that expect tidy data.

For example, in private equity, a pivot table might help an analyst compare EBITDA margins across portfolio companies by quarter. Later, melting the data could help reformat it for regression analysis or visualization. These reshaping operations streamline the workflow when toggling between analysis-ready and presentation-ready data structures.

Pivoting

Pivoting in Pandas refers to transforming or reshaping a DataFrame from a long format to a wide format. It’s a powerful way to reorganize your data so that columns become rows, or vice versa, depending on your analysis needs.

Conceptually:

  • Pivoting turns unique values in one column into new column headers.
  • It’s useful when you want to summarize or display your data in a more readable, structured table format.
    
     df.pivot_table(index='Region', columns='Month', values='Revenue', aggfunc='sum')
    
   

Melting (Unpivoting)

Melting (also known as unpivoting) in Pandas is the reverse of pivoting: it transforms a DataFrame from a wide format into a long format. This is useful when you want to normalize your data or prepare it for analysis or visualization. Conceptually:
  • Melting takes column headers and turns them into row values.
  • It reduces the number of columns and increases the number of rows.
  • Great for preparing time series data or feeding into plotting tools like Seaborn or Matplotlib.
    
     pd.melt(df, id_vars='Region', value_vars=['Jan', 'Feb', 'Mar'])
    
   

4.7 Applying Functions to Columns

Using .apply() to run custom functions

Using .apply() in Pandas allows you to run a custom function across a DataFrame or Series — row-wise or column-wise. When used on a Series, .apply() applies a function to each element. When used on a DataFrame, you specify whether to apply the function row-wise (axis=1) or column-wise (axis=0).

    
     df['Margin'] = df.apply(lambda row: row['Revenue'] - row['Cost'], axis=1)
    
   

Using .map() on a Series

The .map() function in Pandas is used to transform the values of a Series by applying a function or mapping dictionary to each element. It’s ideal for element-wise transformations when working with a single column. .map() only works on Series (not entire DataFrames) It’s great for many tasks like: value replacement based on a dictionary, Formatting/computing with a function, and cleaning or standardizing values.
    
     df['Region_Code'] = df['Region'].map({'West': 1, 'East': 2, 'North': 3})
    
   

5. Data Cleaning and Preprocessing

Before you can analyze or model your data, often times you need to clean it. This step is often where data scientists spend most of their time — and Pandas makes it significantly easier with a rich set of tools for wrangling messy, inconsistent, or incomplete data.

This section covers the most common data cleaning and preprocessing techniques in Pandas.

5.1 Handling Missing Data

Missing data is a common issue during financial modeling, valuation, or due diligence—especially when aggregating data from multiple sources like company financial statements, data rooms, or market databases. Pandas has a number of very useful helper functions to identify, remove, and correct missing data in datasets.

Detecting Missing Data

Detecting missing data is an essential part of data cleaning in Pandas. It allows you to identify and handle gaps in your dataset, such as blank cells, NaN values, or missing timestamps.

    
     df.isna()             # True/False for each cell
df.isnull().sum()     # Count of nulls per column

    
   

Removing Missing Data

Removing missing data in Pandas is a crucial step when preparing your dataset for analysis or modeling. The df.dropna() method helps eliminate rows or columns that contain NaN (missing) values, giving you control over how clean your data needs to be.

    
     df.dropna()                   # Drop any rows with NaNs
df.dropna(axis=1)            # Drop columns with any NaNs
df.dropna(subset=['Price'])  # Drop rows where 'Price' is NaN
    
   

Filling Missing Data

Filling missing data in Pandas is a flexible and powerful way to handle NaN values without discarding valuable rows or columns. The .fillna() method lets you replace missing values using a variety of strategies depending on the context of your dataset.

    
     df.fillna(0)                         # Replace NaNs with 0
df.fillna(method='ffill')           # Forward fill (use previous value)
df.fillna(df['Revenue'].mean())     # Replace with column mean

    
   

5.3 Removing Duplicates

Duplicate rows can skew results, especially in aggregations. Pandas’ df.duplicated() identifies rows that are duplicates of previous ones by returning True for each repeated row, while df.drop_duplicates(inplace=True) permanently removes those duplicate rows from the DataFrame, keeping only the first occurrence.

    
     df.duplicated()                 # Returns True for duplicated rows
df.drop_duplicates(inplace=True)  # Removes duplicated rows
    
   

You can also drop based on specific columns:

    
     df.drop_duplicates(subset=['CustomerID', 'OrderID'])
    
   

5.4 Replacing Values

You can clean inconsistent formatting or standardize values:

    
     df['Status'].replace({'pending': 'Pending', 'PENDING': 'Pending'}, inplace=True)
df['Category'].replace('Misc', np.nan, inplace=True)  # Convert to NaN
    
   

5.2 Type Conversion

Pandas automatically infers data types, but you often need to correct or convert them.

    
     df['Date'] = pd.to_datetime(df['Date'])        # Convert to datetime
df['Price'] = df['Price'].astype(float)        # Convert to float
df['Region'] = df['Region'].astype('category') # Convert to categorical type

    
   

Use .dtypes to check current types and .astype() for manual changes.

5.5 Standardizing Text Data

Text data often needs normalization (especially when imported from spreadsheets or manual entry):

    
     df['city'] = df['city'].str.strip().str.lower()
df['product'] = df['product'].str.replace('-', ' ').str.title()
    
   

5.6 Applying Custom Functions

Use .apply() to run a function across rows or columns:

    
     def categorize(row):
    if row['Revenue'] > 1000:
        return 'High'
    else:
        return 'Low'

df['Revenue_Level'] = df.apply(categorize, axis=1)
    
   

5.7 Creating New Columns/Variables

One of the amazing features of the DataFrame is the ability to create new columns from existing data; either from 

    
     df['Profit'] = df['Revenue'] - df['Cost']
df['Revenue_per_Item'] = df['Revenue'] / df['Quantity']
df['Is_High_Value'] = df['Revenue'] > 500
    
   

4.5 Renaming and Reshaping Columns

Sometimes column names need cleaning.
    
     df.rename(columns={'Rev': 'Revenue'}, inplace=True)
df.columns = df.columns.str.lower()  # Make all columns lowercase
    
   

6. Grouping and Aggregation

Once your data is clean and structured, you’ll often want to summarize it — to calculate totals, averages, counts, or other metrics across categories or time periods. That’s where Pandas’ powerful groupby functionality comes in.

This section walks you through grouping, aggregating, and summarizing data using real-world patterns.

6.1 The GroupBy Concept

The groupby() method follows a simple, powerful pattern:

Split → Apply → Combine

You split the data into groups based on some key(s), apply a function to each group independently, and then combine the results into a new DataFrame or Series.

6.2 Simple Grouping and Aggregation

Suppose you have a sales dataset and want to find the total revenue by region:
    
     df.groupby('Region')['Revenue'].sum()
    
   

Want the average revenue instead?

    
     df.groupby('Region')['Revenue'].mean()
    
   

Want multiple aggregations?

    
     df.groupby('Region')['Revenue'].agg(['sum', 'mean', 'max'])
    
   

6.3 Grouping by Multiple Columns

You can group by more than one column to get hierarchical summaries.

    
     df.groupby(['Region', 'Product'])['Revenue'].sum()
    
   

This returns a MultiIndex result — a layered index that lets you drill down into combinations of dimensions.

6.4 Resetting the Index

After grouping, your index becomes the grouping column(s). You can convert it back to regular columns using .reset_index():

    
     grouped = df.groupby('Region')['Revenue'].sum().reset_index()
    
   

This is especially helpful for exporting or plotting.

6.5 Filtering Within Groups

You can use .filter() to return only those groups that meet a condition.

Example: keep only regions with more than 100 total orders:

    
     df.groupby('Region').filter(lambda x: x['OrderID'].count() > 100)
    
   

6.6 Custom Aggregation with agg()

The .agg() method lets you apply multiple functions at once, or use custom ones.

    
     df.groupby('Region').agg({
    'Revenue': 'sum',
    'Cost': 'mean',
    'OrderID': 'count'
})
    
   

You can also define your own functions:

    
     df.groupby('Product')['Revenue'].agg(lambda x: x.max() - x.min())
    
   

6.7 Working with MultiIndex Results

When grouping by multiple columns, Pandas returns a MultiIndex object, which can be flattened if needed:

    
     grouped = df.groupby(['Region', 'Product'])['Revenue'].sum()
grouped = grouped.unstack()   # Makes 'Product' columns, 'Region' rows
    
   

6.8 Descriptive Statistics per Group

This gives you count, mean, std, min, max, and quartiles for each group — a fast way to summarize distributions.
    
     df.groupby('Region')['Revenue'].describe()
    
   

6.9 Pivot Tables (Alternative to GroupBy)

Pivot tables provide a cleaner syntax for grouped summaries, especially in 2D.

6.9 Pivot Tables (Alternative to GroupBy)

Pivot tables provide a cleaner syntax for grouped summaries, especially in 2D.

    
     df.pivot_table(index='Region', columns='Month', values='Revenue', aggfunc='sum')
    
   

7. Merging, Joining, and Concatenation

In real-world projects, your data rarely lives in a single file or table. Instead, you’ll often work with multiple datasets that need to be combined—for example, transactions with customer info, or products with pricing tables. 

Pandas offers powerful tools for merging and joining, inspired by SQL but made even more flexible.

7.1 The merge() Function

The merge() function is the most common function for combining data, and is used to join DataFrames together based on one or more common keys.

To understand the merge() function – the best resource is to start by becoming comfortable with SQL joins. Below is a brief overview of the four common types of joins, and how each leads to very different data in your post-join dataframe.

Inner Join – The Default Merge

By default, when you use the .merge() function – Pandas will assume you are attempting an Inner Join on your data.

In this case, only rows where your key exists in both DataFrames will be included in your resulting DataFrame. Rows with unmatched keys will be removed from your resulting DataFrame.

Image Source: https://pandas.pydata.org
    
     pd.merge(customers, orders, on='CustomerID') #Default merge - assumes inner join
pd.merge(df1, df2, on='ID', how='inner')  # Explicit inner join
    
   

Left Join

Keeps all rows from df1, the left DataFrame.

  • If there’s a matching ID in df2, the corresponding data is added.
  • If there’s no match, columns from df2 will contain NaN.

Use when: You want to preserve everything from df1 and only add data from df2 where available.

    
     pd.merge(df1, df2, on='ID', how='left')   # Left join
    
   

Right Join

Keeps all rows from df1, the left DataFrame.

  • If there’s a matching ID in df2, the corresponding data is added.
  • If there’s no match, columns from df2 will contain NaN.

Use when: You want to preserve everything from df1 and only add data from df2 where available.

    
     pd.merge(df1, df2, on='ID', how='right')  # Right join
    
   

Outer Join

Keeps all rows from both df1 and df2.

  • Matches on ID where possible.
  • Rows with no match in the other DataFrame will have NaN for the missing columns.

Use when: You want a complete view, including all unmatched records from both sides.

    
     pd.merge(df1, df2, on='ID', how='outer')  # Full outer join
    
   

7.2 The join()Method

In addition to the merge() function, Pandas offers a DataFrame method join() for 

DataFrame.join() is a convenient method when you’re joining on a single key or index. The Join() method is called on the left-hand dataframe.

    
     df1.set_index('ID').join(df2.set_index('ID'))

    
   

You can also pass how='left', how='right', etc., just like merge().

7.3 Concatenation with concat()

Use pd.concat() to stack DataFrames vertically or horizontally.

Stack Vertically (like appending rows)

Image Source: https://pandas.pydata.org
    
     pd.concat([df1, df2], axis=0)

    
   

Stack Horizontally (like adding columns)

    
     pd.concat([df1, df2], axis=1)
    
   

Add ignore_index=True if you want to reset the index during vertical concatenation.

8. Time Series Capabilities

One of Pandas’ most powerful and underrated features is its built-in support for time series data.

This is incredibly important in Finance applications where the vast majority of data is time series. Whether you’re analyzing stock prices, forecasting sales, or aggregating sensor data by minute, Pandas provides the tools to work with time-based data intuitively and efficiently.

8.1 Working with Dates and Times

The first step to prepare your DataFrame for working with time series data is to convert your appropriate columns to datetime using pd.to_datetime():

    
     df['Date'] = pd.to_datetime(df['Date'])
    
   

Once a column is in datetime format, you can:

  • Extract components (.dt.year, .dt.month, .dt.day)

  • Filter by date

  • Sort chronologically

  • Use it as an index for time-based operations

8.2 DatetimeIndex and Time-Based Indexing

Setting a datetime column as the index unlocks time series functionality:

    
     df = df.set_index('Date')
    
   

Now you can slice by date:

    
     df['2023']           # All data from 2023
df['2023-03']        # All data from March 2023
df['2023-03-15']     # Exact day
    
   

You can also sort by time:

    
     df = df.sort_index()
    
   

8.3 Resampling: Changing the Frequency of Data

Resampling lets you convert between frequencies—e.g., from daily to monthly, or hourly to weekly.

resample()

    
     df.resample('M').mean()    # Monthly average
df.resample('Q').sum()     # Quarterly totals
    
   
Code Frequency
'D' Day
'W' Week
'M' Month-End
'Q' Quarter-End
'Y' Year-End
'H' Hour
'T' or 'min' Minute

8.4 Rolling Windows and Moving Averages

Rolling functions help you smooth out noisy time series data or compute metrics over a trailing window.

Simple Moving Average

This calculates a 7-day moving average.

    
     df['RollingAvg'] = df['Price'].rolling(window=7).mean()
    
   

Rolling with Custom Aggregations

You can also use .apply() with custom functions on rolling windows.

    
     df['RollingStd'] = df['Price'].rolling(30).std()
df['RollingMin'] = df['Price'].rolling(14).min()
    
   

8.5 Shifting

Shifting helps compare values across time lags:
    
     df['Yesterday'] = df['Price'].shift(1)
    
   

8.6 Time Zones and Localization

Pandas supports time zone-aware data.

This is useful when combining global datasets or aligning financial markets across time zones.

    
     df.index = df.index.tz_localize('UTC')               # Add time zone
df.index = df.index.tz_convert('America/New_York')   # Convert

    
   

9. Visualization with Pandas

Congratulations! You’ve made it to data visualization. 

Data visualization is key to understanding trends, spotting anomalies, and communicating insights. While libraries like Matplotlib, Seaborn, and Plotly are great for custom plots, Pandas provides convenient built-in plotting methods that cover most day-to-day needs—especially for quick exploratory analysis.

These plotting tools are powered by Matplotlib under the hood, so you get the best of both worlds: fast prototyping and customizable output.

9.1 Getting Started

To use plotting in Pandas, you’ll first need to ensure Matplotlib is installed:

    
     pip install matplotlib
    
   

9.2 Line Plots (Default)

When you call .plot() on a time series or numeric column, Pandas defaults to a line plot:

    
     df['Revenue'].plot()(title='Revenue Over Time')
    
   

This is especially useful for:

  • Time series (daily prices, sales, etc.)

  • Trends over periods

9.3 Bar Plots

Use bar plots to compare categories:

    
     df.groupby('Region')['Revenue'].sum().plot(kind='bar', title='Revenue by Region')

    
   

Use kind='barh' for a horizontal bar chart.

9.4 Histograms

Histograms help you understand the distribution of a single variable:

    
     df['Revenue'].plot(kind='hist', bins=20, title='Revenue Distribution')

    
   

This is useful for spotting skewed data or outliers.

9.5 Boxplots

Boxplots show median, quartiles, and outliers—great for comparing distributions across groups:

    
     df.boxplot(column='Revenue', by='Region')

    
   

9.6 Scatter Plots

To show relationships between two numeric variables:

    
     df.plot(kind='scatter', x='Revenue', y='Cost')

    
   

9.7 Area Charts

These are less common but available for completeness:

    
     df.plot(kind='area', stacked=False, alpha=0.4)
df['Region'].value_counts().plot(kind='pie', autopct='%1.1f%%')

    
   

9.7 Pie Charts

These are less common but available for completeness:

    
     df.plot(kind='area', stacked=False, alpha=0.4)
df['Region'].value_counts().plot(kind='pie', autopct='%1.1f%%')

    
   

9.8 Plot Customization

Since Pandas uses Matplotlib, you can customize your plots extensively:

    
     import matplotlib.pyplot as plt

ax = df['Revenue'].plot(title='Revenue Over Time', figsize=(10, 5), color='green')
ax.set_xlabel('Date')
ax.set_ylabel('Revenue ($)')
plt.grid(True)
plt.show()

    
   

9.9 Multiple Series in One Plot

    
     df[['Revenue', 'Cost']].plot()

    
   

This produces a multi-line plot showing both series for easy comparison.

9.10 Saving Plots

You can save any Matplotlib figure using:

    
     plt.savefig('revenue_plot.png')

    
   

Python 1: Core Data Analysis

Learn the essentials of working with Pandas as part of our Python 1: Core Data Analysis Course: 

  • Introduction to Pandas
  • Introduction to NumPy
  • Installing and configuring Python
  • Python Data Types & Functions
  • Logic / Conditional Statements / Loops
Public Courses

Courses taught in-person in over 20 cities around world and virtual options available. Click below to browse all upcoming Public Courses.

Corporate SOlutions

Training The Street creates customized training programs that deliver the right content at the right time to maximize talent development.