Python Packages 101 — Part 2 - Training The Street - Financial Training, Finance Training, Investment Banking Training, Financial Modeling Training

In our previous article, Python Packages 101 – Part 1, we introduced the first 10 Python packages of our top 25 list. Those packages focused on data manipulation, web scraping and visualization. In this article we will provide an overview of the remaining 15 packages in the following more advanced categories:

Dashboarding: Dash; Streamlit
File management: OS; Pathlib; Shutil; Pillow; Camelot; Tabula
Statistical analysis: Statsmodels; SciPy
Machine learning: Scikit-learn; NLTK; SpaCy; OpenCv; PyTesseract

Dashboarding Packages

In our last article we discussed several visualization packages, such as Plotly and Bokeh, that provide the capability of creating interactive charts and the “feel” of dashboarding software programs such as Tableau and Microsoft’s Power BI. Python also has several packages that can allow for the creation of more complex dashboards with interactive dropdowns, radio buttons and sliders that can control multiple charts and outputs at the same time.

There are two main competing popular packages for creating dashboards in Python:

Dash, designed by the same creators of Plotly : https://plotly.com/dash/
Streamlit, designed by engineers from Google and Twitter: https://www.streamlit.io/

Both packages are free and allow for the creation of fairly complex dashboards with very few lines of code. The dashboards open as stand-alone websites in your browser by running them locally on your computer or by hosting them in an online cloud service such as Microsoft’s Azure or Amazon’s AWS. The functionality and primary uses are similar for both packages and described in the summary table below. The two packages differ in that:

Dash allows for more customization and formatting; however, it is a bit more of a learning curve and requires some minimal knowledge of web design coding (HTML tags and CSS for styling)
Streamlit does not allow for as much customization in formatting; however, it is more streamlined and easier to use for coders with no web design knowledge or experience

Python Packages 101 — Part 2

Figure 1: Streamlit Dashboard

Dash

Documentation: https://dash.plotly.com/
Gallery: https://dash-gallery.plotly.host/Portal/

Streamlit

Documentation: https://docs.streamlit.io/en/stable/
Gallery: https://streamlit.io/gallery

Functionality

Dashboard is launched as a new tab in your browser as a “web app”
The web app can be hosted locally on your computer or shared drive or can be uploaded to an online server (e.g. Amazon AWS, Google Collaborate, Microsoft Azure, etc.)
Both allow for “debugging” on the fly being able to see the changes to the dashboard as changes are made in the code without having to relaunch the web app
Integration with visualization packages such as Plotly and matplotlib

Primary Uses

They allow for rapid deployment of a dashboard with very minimal or no web design experience
They allow for creating interactive elements that will filter and update your charts and DataFrames on the fly, such as dropdowns, radio buttons, sliders, checkboxes, buttons, etc.

Use in Finance

Creating a portfolio dashboard to view profits and losses, IRRs and current valuation metrics of investments with capability of filtering by sector, accounts, time periods and currency
Creating a client invoices dashboard to view revenue generated by client, types of product, time periods, and geography

File Management

Python can also be used to automate tedious and repetitive tasks such as creating, opening, copying, renaming and deleting folders and files. Three main packages are used for folder and file management and are typically pre-installed with Python:

OS
Pathlib
Shutil

The above packages have very similar uses with slight differences in how they handle some of the functionality (e.g. OS model allows for copying or deleting of a single folder, vs. Shutil allows for deleting folders and all their contents including subfolders).

Other packages are more proficient at handling one specific type of files. For example, Pillow is the go-to package for handling images in Python and Tabula and Camelot are two powerful packages for extracting tables out of PDF files.

File and Directory Management Packages

OS: https://docs.python.org/3/library/os.html

Pathlib: https://pathlib.readthedocs.io/en/pep428/

Shutil: https://docs.python.org/3/library/shutil.html

Functionality

Access and control to operating system files and folders
Opening and closing files
Copying, renaming, moving, and deleting files and folders

Primary Uses

Organizing hundreds of files and folders in an automated way
Grabbing a list of all file names of one file type in a folder (e.g. all CSVs or PDFs)
Cleaning up the names of multiple files

Use in Finance

Creating a data room for a financial transaction (e.g. merger or acquisition) with custom named folders and files using a summary table from Excel

Pillow

Website: https://python-pillow.org/
Documentation: https://pillow.readthedocs.io/en/stable/handbook/index.html

Functionality

Pillow is the friendlier, easier to use version of the PIL (Python Imaging Library) package in Python
It provides image processing capabilities within Python and can handle and extensive list of image file formats such as PNG, JPEG, GIF, BMP, EPS and others
Cropping, resizing and editing contents of images

Primary Uses

Opening and closing images in Python
Creating thumbnails of multiple images
Applying image filters such as smooth, blur, sharpen, contour and others
Converting multiple images from one file format (e.g. JPG) to another (e.g. PNG)

Use in Finance

Open pictures of deal “tombstones” to be later analyzed and extract text with other OCR packages
Import multiple logos of potential investment companies to be resized and cleaned up for a pitch presentation

PDF Packages

Tabula

Documentation: https://tabula-py.readthedocs.io/en/latest/

Camelot

Documentation: https://camelot-py.readthedocs.io/en/master/
Comparison to Tabula: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools

Functionality

Extract tables from PDFs
Convert PDF tables into Pandas DataFrames
Convert PDFs into CSV, JSON, HTML, Excel and other formats
Dynamically find data based on keywords inside the tables
Visualize the tables being extracted
Finetune settings to account for tables with no borders and whitespaces between table cells

Primary Uses

Create data sets from multiple PDF files
Extract, clean and convert tables from PDFs into Excel format in a more efficient and automated manner

Use in Finance

Extract, clean and convert tables from PDFs into Excel format in a more efficient and automated manner
Extract key data from tables in SEC filings and financial reports
Consolidate portfolio transactions from multiple PDF files from brokerage accounts
Capture industry data from tables stored in PDF files
Convert tables research reports into Excel files

Statistical Analysis

There are two core packages used by data scientists in Python to perform typical statistical analysis: statsmodels and SciPy. A third package, Scikit-learn, is used for more advanced machine learning algorithms and is described in the following section.

Statsmodels

Website: https://www.statsmodels.org/stable/index.html
Documentation: https://www.statsmodels.org/stable/gettingstarted.html

Functionality

Conducting statistical tests
Classes and functions for estimation of many different statistical models
Statistical data exploration
Integration with Pandas DataFrames

Primary Uses

Linear regression and time series analysis
OLS regression results, including R-square d, F-stats, and confidence intervals

Use in Finance

Linear regression to calculate beta for CAPM (Capital Asset Pricing Model)
Time series analysis using ARIMA model
Calculating betas of multiple factors in a portfolio using a multivariate linear regression model

SciPy

Website: https://www.scipy.org/
Documentation: https://docs.scipy.org/doc/scipy/reference/

Functionality

Provides functions in mathematics, science and engineering
Similar functions to other statistical programs such as MATLAB and Octave
Functions built on top of the Numpy package

Primary Uses

Includes common numerical functions, including integration, optimization, interpolation, Fourier transforms and many others
Linear algebra applications

Use in Finance

Portfolio optimization using minimization function to find the optimal weights of investments and minimizing volatility at a given required portfolio return

Figure 3: OLS Results from statsmodels

Machine Learning Algorithms

Python has become very popular in the data science community due to the large amount of Machine Learning and AI algorithms available through third party packages.

Scikit-learn is the most used package for Machine Learning and has algorithms for the following applications:

Classification: identifying which category an object belongs to; e.g. after training a model what is spam and what is not, the classifier model will “classify” new emails
Regression: predicting continuous valued attributes associated with independent variables; e.g. predicting returns of portfolio based on certain factors (market risk premium, size premium, etc.)
Clustering: automatic grouping of similar objects into sets; e.g. allocating customers into different categories based on spending habits and other characteristics

Scitkit-learn

Website: https://scikit-learn.org/stable/
Documentation: https://scikit-learn.org/stable/getting_started.html

Functionality

One of the core machine learning packages in Python community
Provides machine algorithms such as classification, regression, cluster detection, dimensionality reduction, data preprocessing and model selection
Cleaning and preparing datasets for forecasting models: splitting data sets into testing vs training data, creating dummy variables for categorical fields, eliminating outliers
Model evaluation: fine-tuning model parameters and analysing overfitting, comparing R-squared metrics and other model scores

Primary Uses

Forecasting more complex data that can’t be easily modeled using a linear regression model
Categorizing data in an automatic fashion

Use in Finance

Determining credit rating of a company based on multiple independent variables, both numerical and categorical
Finding the optimal capital structure and debt capacity of a company
Determining the target price of a company using multiple key financial ratios and historical financials of a company
Classifying customers of a company by spending habits to refine revenue buildup assumptions in an operating model

There are also higher level artificial intelligence packages that have been “trained” and perfected over the years with machine learning algorithms that can be used right away in practical applications:

OCR — Optical Character Recognition
NLP — Natural Language Processing

Optical Character Recognition (OCR) is a branch of AI that allows computers to recognize text in images or scanned documents. The steps for using OCR in Python are:

Load an image into Python using an imaging package that processes the picture
Use an OCR package to analyze the image and extract any text

The image processing is usually achieved with a package such as OpenCV and Google’s Tesseract is used for the text recognition.

In addition, Natural Language Processing (NLP) is a branch of machine learning and AI that allows computers to understand human language and classifies and groups togethers parts of text to extract key information. NLP is used on a daily basis in interactions with Google Home, Siri, Alexa and chatbots and in the finance and business community it is primarily used to extract key data from press releases and articles. It is also used to an extent to determine the “sentiment” of an article, tweet, filing etc. Two popular Python packages used for NLP are NLTK and SpaCy.

OpenCV

Website: https://opencv.org/
Documentation: https://docs.opencv.org/master/

Functionality

Open source computer vision and machine learning software library
Used to open, process and transform images
Used to identify special objects in pictures (e.g. eyes, faces, trees, etc.)

Primary Uses

Used to open and process images before text is extracted with more advanced packages such as Tesseract

Use in Finance

Open multiple scanned images of legal documents
Open and process logos of companies

PyTesseract

Website: https://opensource.google/projects/tesseract
Documentation: https://github.com/madmaze/pytesseract

Functionality

PyTesseract is the Python implementation of Google’s Tesseract technology
Supports multiple image formats, including images processed from OpenCV or Pillow packages
Supports multiple languages

Primary Uses

Extracts and converts text from images into Python strings

Use in Finance

Extract all text from scanned purchase agreements
Extract company names and financial figures from hundreds of deal “tombstones”

NLP Packages

NLTK

Website: http://www.nltk.org/
Documentation: https://github.com/nltk/nltk/wiki

SpaCy

Website: https://spacy.io/
Documentation: https://spacy.io/usage

Functionality

Tokenization: Segmenting text into words, punctuation marks,
Part of Speech (POS) Tagging: Assigning word types to tokens, e.g. verb or noun
Named Entity Recognition (NER): Labelling named “real world” people, companies and locations
Text Classification: Assigning categories or labels to a whole document, or parts of a document
Both packages have models of “taught” words that act as starting dictionaries

Primary Uses

Extracting key words from press releases, essays, and text documents
Translating words from one language to another
Analyzing sentiment of articles

Use in Finance

Extracting key information from SEC filings
Summarizing a company’s quarterly earnings press release seconds after it is filed
Extracting all companies mentioned, dates and financial figures from hundreds of articles on an industry website

Figure 4: Extracted key words using SpaCy

Cheat Sheet and Next Article

Below is a link to a cheat sheet summarizing all packages discussed in the Part 1 & Part 2 articles. The cheat sheet provides a summary of all the packages, their categories, conda codes to install these packages with Anaconda. As well as links to the documentation and Anaconda Repo.

Python Packages — Cheat Sheet.pdf

We hope you enjoyed this series of articles and if you have any further questions, please reach out to us.

Learn with Training Street

Training The Street’s Python training:

Python Training Public Course

With our Python for finance course, students will gain the skills needed to develop Python programs. This will help solve typical Finance problems, cutting through the noise of generic “Data Science” courses.

Python Training course options:

Python 1: Core Data Analysis
Python 2: Visualization and Analysis
Python 3: Web Scraping and Dashboarding

LEARN MORE

Self-Study: Python Fundamentals Course

Learn programming for business and finance professionals

LEARN MORE

Self-Study: Applied Machine Learning Course

Apply custom machine learning algorithms with Python

LEARN MORE

Fundamentals

Jump-start your career with our most popular courses for less

Foundations + Advanced

MOST POPULAR

Everything you need to get from your first internship to full time roles.

Foundations

Kick-start your career in Finance. Build a understanding of the financial concepts and skills needed to succeed in internships and beyond.

Advanced

Develop the high-level financial modeling skills needed to secure and excel in full-time Financial roles.

INTERVIEW PREP

Prepare for finance interviews with technical and behavioral questions

Role Prep

Prepare and excel in specific roles from Day-1 with curated courses

INDUSTRY-SPECIFIC

Tailored to help transition into industry-specific roles

PRODUCTIVITY

Maximize and enhance your productivity with our Microsoft courses

Other Resources

PUBLIC COURSES

Our turnkey finance training courses make new, lateral and off-cycle hires ‘desk-ready’ productive members of your team in no time.

Public Course Calendar

Courses taught in-person in over 20 cities around world and virtual options available. Click below to browse all upcoming Public Courses.

CORPORATE SOLUTIONS

Training The Street creates customized training programs that deliver the right content at the right time to maximize talent development.

About US

RESOURCES

Python Packages 101 — Part 2

Dashboarding Packages

Dash

Streamlit

Functionality

Primary Uses

Use in Finance

File Management

File and Directory Management Packages

Functionality

Primary Uses

Use in Finance

Pillow

Functionality

Primary Uses

Use in Finance

PDF Packages

Tabula

Camelot

Functionality

Primary Uses

Use in Finance

Statistical Analysis

Statsmodels

Functionality

Primary Uses

Use in Finance

SciPy

Functionality

Primary Uses

Use in Finance

Machine Learning Algorithms

Scitkit-learn

Functionality

Primary Uses

Use in Finance

OpenCV

Functionality

Primary Uses

Use in Finance

PyTesseract

Functionality

Primary Uses

Use in Finance

NLP Packages

NLTK

SpaCy

Functionality

Primary Uses

Use in Finance

Cheat Sheet and Next Article

Learn with Training Street

Tailored to help transition into industry-specific roles