TTS Blogs, Python

Python Packages 101 — Part 2

In our previous article, Python Packages 101 – Part 1, we introduced the first 10 Python packages of our top 25 list. Those packages focused on data manipulation, web scraping and visualization. In this article we will provide an overview of the remaining 15 packages in the following more advanced categories:

Dashboarding Packages

In our last article we discussed several visualization packages, such as Plotly and Bokeh, that provide the capability of creating interactive charts and the “feel” of dashboarding software programs such as Tableau and Microsoft’s Power BI. Python also has several packages that can allow for the creation of more complex dashboards with interactive dropdowns, radio buttons and sliders that can control multiple charts and outputs at the same time.

There are two main competing popular packages for creating dashboards in Python:

Both packages are free and allow for the creation of fairly complex dashboards with very few lines of code. The dashboards open as stand-alone websites in your browser by running them locally on your computer or by hosting them in an online cloud service such as Microsoft’s Azure or Amazon’s AWS. The functionality and primary uses are similar for both packages and described in the summary table below. The two packages differ in that:

  • Dash allows for more customization and formatting; however, it is a bit more of a learning curve and requires some minimal knowledge of web design coding (HTML tags and CSS for styling)
  • Streamlit does not allow for as much customization in formatting; however, it is more streamlined and easier to use for coders with no web design knowledge or experience

Python Packages 101 — Part 2

Figure 1: Streamlit Dashboard

 

Dash

Documentation: https://dash.plotly.com/
Gallery: https://dash-gallery.plotly.host/Portal/

Streamlit

Documentation: https://docs.streamlit.io/en/stable/
Gallery: https://streamlit.io/gallery

Functionality

  • Dashboard is launched as a new tab in your browser as a “web app”
  • The web app can be hosted locally on your computer or shared drive or can be uploaded to an online server (e.g. Amazon AWS, Google Collaborate, Microsoft Azure, etc.)
  • Both allow for “debugging” on the fly being able to see the changes to the dashboard as changes are made in the code without having to relaunch the web app
  • Integration with visualization packages such as Plotly and matplotlib

Primary Uses

  • They allow for rapid deployment of a dashboard with very minimal or no web design experience
  • They allow for creating interactive elements that will filter and update your charts and DataFrames on the fly, such as dropdowns, radio buttons, sliders, checkboxes, buttons, etc.

Use in Finance

  • Creating a portfolio dashboard to view profits and losses, IRRs and current valuation metrics of investments with capability of filtering by sector, accounts, time periods and currency
  • Creating a client invoices dashboard to view revenue generated by client, types of product, time periods, and geography

File Management

Python can also be used to automate tedious and repetitive tasks such as creating, opening, copying, renaming and deleting folders and files. Three main packages are used for folder and file management and are typically pre-installed with Python:

  • OS
  • Pathlib
  • Shutil

The above packages have very similar uses with slight differences in how they handle some of the functionality (e.g. OS model allows for copying or deleting of a single folder, vs. Shutil allows for deleting folders and all their contents including subfolders).

Other packages are more proficient at handling one specific type of files. For example, Pillow is the go-to package for handling images in Python and Tabula and Camelot are two powerful packages for extracting tables out of PDF files.

File and Directory Management Packages

OS: https://docs.python.org/3/library/os.html

Pathlib: https://pathlib.readthedocs.io/en/pep428/

Shutil: https://docs.python.org/3/library/shutil.html

Functionality

  • Access and control to operating system files and folders
  • Opening and closing files
  • Copying, renaming, moving, and deleting files and folders

Primary Uses

  • Organizing hundreds of files and folders in an automated way
  • Grabbing a list of all file names of one file type in a folder (e.g. all CSVs or PDFs)
  • Cleaning up the names of multiple files

Use in Finance

  • Creating a data room for a financial transaction (e.g. merger or acquisition) with custom named folders and files using a summary table from Excel

Pillow

Website: https://python-pillow.org/
Documentation: https://pillow.readthedocs.io/en/stable/handbook/index.html

Functionality

  • Pillow is the friendlier, easier to use version of the PIL (Python Imaging Library) package in Python
  • It provides image processing capabilities within Python and can handle and extensive list of image file formats such as PNG, JPEG, GIF, BMP, EPS and others
  • Cropping, resizing and editing contents of images

Primary Uses

  • Opening and closing images in Python
  • Creating thumbnails of multiple images
  • Applying image filters such as smooth, blur, sharpen, contour and others
  • Converting multiple images from one file format (e.g. JPG) to another (e.g. PNG)

Use in Finance

  • Open pictures of deal “tombstones” to be later analyzed and extract text with other OCR packages
  • Import multiple logos of potential investment companies to be resized and cleaned up for a pitch presentation

PDF Packages

Tabula

Documentation: https://tabula-py.readthedocs.io/en/latest/

Camelot

Documentation: https://camelot-py.readthedocs.io/en/master/
Comparison to Tabula: https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools

Functionality

  • Extract tables from PDFs
  • Convert PDF tables into Pandas DataFrames
  • Convert PDFs into CSV, JSON, HTML, Excel and other formats
  • Dynamically find data based on keywords inside the tables
  • Visualize the tables being extracted
  • Finetune settings to account for tables with no borders and whitespaces between table cells

Primary Uses

  • Create data sets from multiple PDF files
  • Extract, clean and convert tables from PDFs into Excel format in a more efficient and automated manner

Use in Finance

  • Extract, clean and convert tables from PDFs into Excel format in a more efficient and automated manner
  • Extract key data from tables in SEC filings and financial reports
  • Consolidate portfolio transactions from multiple PDF files from brokerage accounts
  • Capture industry data from tables stored in PDF files
  • Convert tables research reports into Excel files
PDF Table Visualization in Camelot
Figure 2: PDF Table Visualization in Camelot

Statistical Analysis

There are two core packages used by data scientists in Python to perform typical statistical analysis: statsmodels and SciPy. A third package, Scikit-learn, is used for more advanced machine learning algorithms and is described in the following section.

Statsmodels

Website: https://www.statsmodels.org/stable/index.html
Documentation: https://www.statsmodels.org/stable/gettingstarted.html

Functionality

  • Conducting statistical tests
  • Classes and functions for estimation of many different statistical models
  • Statistical data exploration
  • Integration with Pandas DataFrames

Primary Uses

  • Linear regression and time series analysis
  • OLS regression results, including R-square d, F-stats, and confidence intervals

Use in Finance

  • Linear regression to calculate beta for CAPM (Capital Asset Pricing Model)
  • Time series analysis using ARIMA model
  • Calculating betas of multiple factors in a portfolio using a multivariate linear regression model

SciPy

Website: https://www.scipy.org/
Documentation: https://docs.scipy.org/doc/scipy/reference/

Functionality

  • Provides functions in mathematics, science and engineering
  • Similar functions to other statistical programs such as MATLAB and Octave
  • Functions built on top of the Numpy package

Primary Uses

  • Includes common numerical functions, including integration, optimization, interpolation, Fourier transforms and many others
  • Linear algebra applications

Use in Finance

  • Portfolio optimization using minimization function to find the optimal weights of investments and minimizing volatility at a given required portfolio return

Figure 3: OLS Results from statsmodels

Machine Learning Algorithms

Python has become very popular in the data science community due to the large amount of Machine Learning and AI algorithms available through third party packages.

Scikit-learn is the most used package for Machine Learning and has algorithms for the following applications:

  • Classification: identifying which category an object belongs to; e.g. after training a model what is spam and what is not, the classifier model will “classify” new emails
  • Regression: predicting continuous valued attributes associated with independent variables; e.g. predicting returns of portfolio based on certain factors (market risk premium, size premium, etc.)
  • Clustering: automatic grouping of similar objects into sets; e.g. allocating customers into different categories based on spending habits and other characteristics

Scitkit-learn

Website: https://scikit-learn.org/stable/
Documentation: https://scikit-learn.org/stable/getting_started.html

Functionality

  • One of the core machine learning packages in Python community
  • Provides machine algorithms such as classification, regression, cluster detection, dimensionality reduction, data preprocessing and model selection
  • Cleaning and preparing datasets for forecasting models: splitting data sets into testing vs training data, creating dummy variables for categorical fields, eliminating outliers
  • Model evaluation: fine-tuning model parameters and analysing overfitting, comparing R-squared metrics and other model scores

Primary Uses

  • Forecasting more complex data that can’t be easily modeled using a linear regression model
  • Categorizing data in an automatic fashion

Use in Finance

  • Determining credit rating of a company based on multiple independent variables, both numerical and categorical
  • Finding the optimal capital structure and debt capacity of a company
  • Determining the target price of a company using multiple key financial ratios and historical financials of a company
  • Classifying customers of a company by spending habits to refine revenue buildup assumptions in an operating model

There are also higher level artificial intelligence packages that have been “trained” and perfected over the years with machine learning algorithms that can be used right away in practical applications:

  • OCR — Optical Character Recognition
  • NLP — Natural Language Processing

Optical Character Recognition (OCR) is a branch of AI that allows computers to recognize text in images or scanned documents. The steps for using OCR in Python are:

  • Load an image into Python using an imaging package that processes the picture
  • Use an OCR package to analyze the image and extract any text

The image processing is usually achieved with a package such as OpenCV and Google’s Tesseract is used for the text recognition.

In addition, Natural Language Processing (NLP) is a branch of machine learning and AI that allows computers to understand human language and classifies and groups togethers parts of text to extract key information. NLP is used on a daily basis in interactions with Google Home, Siri, Alexa and chatbots and in the finance and business community it is primarily used to extract key data from press releases and articles. It is also used to an extent to determine the “sentiment” of an article, tweet, filing etc. Two popular Python packages used for NLP are NLTK and SpaCy.

OpenCV

Website: https://opencv.org/
Documentation: https://docs.opencv.org/master/

Functionality

  • Open source computer vision and machine learning software library
  • Used to open, process and transform images
  • Used to identify special objects in pictures (e.g. eyes, faces, trees, etc.)

Primary Uses

  • Used to open and process images before text is extracted with more advanced packages such as Tesseract

Use in Finance

  • Open multiple scanned images of legal documents
  • Open and process logos of companies

 

PyTesseract

Website: https://opensource.google/projects/tesseract
Documentation: https://github.com/madmaze/pytesseract

Functionality

  • PyTesseract is the Python implementation of Google’s Tesseract technology
  • Supports multiple image formats, including images processed from OpenCV or Pillow packages
  • Supports multiple languages

Primary Uses

  • Extracts and converts text from images into Python strings

Use in Finance

  • Extract all text from scanned purchase agreements
  • Extract company names and financial figures from hundreds of deal “tombstones”

 

NLP Packages

NLTK

Website: http://www.nltk.org/
Documentation: https://github.com/nltk/nltk/wiki

SpaCy

Website: https://spacy.io/
Documentation: https://spacy.io/usage

Functionality

  • Tokenization: Segmenting text into words, punctuation marks,
  • Part of Speech (POS) Tagging: Assigning word types to tokens, e.g. verb or noun
  • Named Entity Recognition (NER): Labelling named “real world” people, companies and locations
  • Text Classification: Assigning categories or labels to a whole document, or parts of a document
  • Both packages have models of “taught” words that act as starting dictionaries

Primary Uses

  • Extracting key words from press releases, essays, and text documents
  • Translating words from one language to another
  • Analyzing sentiment of articles

Use in Finance

  • Extracting key information from SEC filings
  • Summarizing a company’s quarterly earnings press release seconds after it is filed
  • Extracting all companies mentioned, dates and financial figures from hundreds of articles on an industry website

Figure 4: Extracted key words using SpaCy

Figure 4: Extracted key words using SpaCy

 

Cheat Sheet and Next Article

Below is a link to a cheat sheet summarizing all packages discussed in the Part 1 & Part 2 articles. The cheat sheet provides a summary of all the packages, their categories, conda codes to install these packages with Anaconda. As well as links to the documentation and Anaconda Repo.

Python Packages — Cheat Sheet.pdf

We hope you enjoyed this series of articles and if you have any further questions, please reach out to us.

Learn with Training Street

Training The Street’s Python training:

Python Training Course

Python Training Public Course

With our Python for finance course, students will gain the skills needed to develop Python programs. This will help solve typical Finance problems, cutting through the noise of generic “Data Science” courses.

Python Training course options:

  • Python 1: Core Data Analysis
  • Python 2: Visualization and Analysis
  • Python 3: Web Scraping and Dashboarding
Python Fundamentals

Self-Study: Python Fundamentals Course

Learn programming for business and finance professionals

Applied Machine Learning

Self-Study: Applied Machine Learning Course

Apply custom machine learning algorithms with Python