AdventureWorks Data Warehouse Series#2 : How to Create Bar Chart Visualization Using Python
Prerequisites To Follow this Exercise :
- Microsoft SQL Server Database Express Edition & Adventure Works DataWarehouse – If you Don’t have a Microsoft SQL Server express Database and want to install it in your system and also install AdventureWorks DW , Follow –https://instrovate.com/2019/05/22/download-install-free-microsoft-sql-server-install-adventureworks-database-data-warehouse/
- Python Installed in your System : If you are a new user to Python and want to know how to install Python via the Anaconda Distribution , You can go through the step by step Blog i have written to install Python via Anaconda Distribution & start using Jupyter Notebook : https://instrovate.com/2019/06/09/python-anaconda-distribution-how-to-download-and-install-it-and-run-the-first-python-program/
Once you have the Microsoft SQL Server Express Edition and Python Installed in your system you are Good to Go ahead and follow the below Use Case and Example.
In this blog we create a visualization in Python over AdventureWorksDW to showcase the Sales of different Products according to their category and sub category. The visualization will contain the bar chart to showcase the sales for different categories and subcategories of the products.
In the data model of AdventureWorksDW the fact table FactInternetSales has the transactions where in we find the sales amount incurred in each order transaction.
The Data Model to fetch the required data is as follows:
The join between FactInternetSales and DimProduct can be made using field ProductKey.
The join between DimProduct and DimProductSubCategory can be made using field ProductSubcategoryKey.
The join between DimProductSubCategory and DimProductCategory can be made using field ProductCategoryKey.
So, the first step is to write a sql query that can fetch the sum of SalesAmount based on ProductSubCategory and ProductCategory. Since ProductCategory is the parent for ProductSubCategory so we will take the sum of SalesAmount based on ProductSubCategory with a column giving the value of parent field ProductCategory. The sql query for the same is as follows:
Now we will write a python code to connect to AdwentureWorksDW database stored in Microsoft SQL server. To learn how to connect python to Microsoft SQL Server please refer to below blog:
https://instrovate.com/2019/04/12/how-to-connect-python-with-sql-server/
So, the python solution for the above problem would begin with making ODBC connection from python to Microsoft SQL server by using the library pyodbc. After connection is established the python code would execute the above query and fetch the results in a python data structure. The code piece for the above solution is as follows:
So, once we have query executed, now the next is to fetch the data. If we are running the same query from Microsoft SQL server the output will looks like as follows:
This is relevant when we process data in python as sequence of fields or columns in each row is important.
To make the visualization the programmer must be familiar with the basic functionality of matplotlib library in python for which the below forum can be referred:
https://instrovate.com/forums/topic/plotting-actual-vs-predicted-sales/
Now to plot the data we would need two lists for each plot i.e. one list having data for the X – Axis and one list having data for the Y – Axis. Also to compute the total sales per Category Wise (Parent Filed) we will initialize one dictionary by the name sales. Below is the code;
In the list sales_cat the sale for each sub-category will be stored.
In product_category the distinct value of each category will be stored.
In product_sub_category the distinct values of each sub category will be stored.
Now we will populate the above data structure with the relevant data that can be used to draw visualization.
In dictionary sales the key used is product_category and the value is the lit containing the SalesAmount for each subcategory corresponding to that parent category that is there in key part.
So, now we have sales for each sub_category in sales_cat list.
Now the next step is to compute the summation of sales across each parent category. The code to compute total sales for each product category is as follows:
So, now we have two types of information:
- Product category with total sales in each product category
- Product SubCategory and total sales in each product SubCategory
We will plot these two informations in two different bar charts.
The barh function of the module plt of matplotlib is used to plot bar chart in python. But before doing that we need to perform some kind of pre-processing to get the visualization as we require. The code for pre processing and creating visualization is as below:
The output visualization is as follows: