A project focused on analyzing unstructured sales data, identifying issues, and deriving Insights. This repository contains SQL and Python code for exploring unstructured data and tableau for visualization
Closed-ended_Question1.sql: SQL query answering the question, What are the top 5 brands by receipts scanned among users 21 and over?
Closed-ended_Question2.sql: SQL query addressing the question, What are the top 5 brands by sales among users that have had their account for at least six months?
Closed-ended_Question3.sql: SQL query providing insights into the percentage of sales in the Health & Wellness category by generation.
Open-ended_Question1.sql: SQL query identifying Fetch’s power users, based on assumptions about user activity.
Open-ended_Question2.sql: SQL query identifying the leading brand in the Dips & Salsa category.
Open-ended_Question3.sql: SQL query calculating Fetch’s growth percentage year over year, based on assumptions about user activity and sales data.
FR_DA.twbx: Tableau workbook visualizing key insights from the datasets. It is a raw TBWX file that can be downloaded for local use, or you can view it publicly here
FR_DataAnalysis.ipynb: Python notebook used for data analysis to identify trends and quality issues.
TransformedProducts.csv: CSV file containing the transformed product data for analysis after data cleaning.
TransformedTransactions.csv: CSV file containing the transformed transaction data for analysis after data cleaning.
TransformedUsers.csv: CSV file containing the transformed user data for analysis after data cleaning.
Subject: Data Quality Findings and Insights
Hi,
I’m Meghana from the Data Analytics team. Our recent analysis of the Users, Products, and Transactions datasets identified several data quality issues and notable insights. And we would like to request your help understanding a few metrics currently in use.
- Significant Missing Details and Duplicates:** All three datasets contain missing information and duplicate entries.
- Improper Formatting:** There are formatting issues, such as the Birth_Date field in Users and the Barcode field in Products.
- Inconsistent Values:** The final_sale metric contains white spaces, while final_quantity includes both numeric and character values, despite being defined as numeric according to the ER Diagram.
- The values in the final_quantity metric are recorded as decimals.
- The same barcodes are associated with different products, and vice versa.
- There are duplicate entries and conflicting values in Final_Sale and Final_Quantity for identical records.
- User growth from 2014 to 2022 was exponential; however, recent years indicate a decline.
- Walmart is the top-performing store across all categories, with "Snacks" being the highest-performing category.
- Could you please provide details on how the values for the final_quantity and final_sale metrics are captured in the transaction data? Additionally, could you clarify why these metrics are labeled ‘final’?
- We would appreciate documentation explaining how each product is differentiated and deemed unique within our data. If duplicates exist, could you elaborate on the rationale for allowing them?
- Please confirm whether we are utilizing a combination of receipt number, user identification number, and barcode to distinguish between each transaction.
Thank you for your attention to these matters. I look forward to your insights.
Best regards,
Meghana
Data Analytics Team