The Diabetes Prediction Project is an initiative aimed at analyzing patient health data to predict diabetes prevalence and identify key health indicators for diabetic patients. This project leverages SQL for database management and data retrieval, allowing efficient exploration and manipulation of patient health records for various analytical purposes.
- Project Overview
- Features
- Database Setup
- Data Analysis Tasks
- Performance Optimization
- Future Improvements
- Contributing
- License
- Efficient retrieval and analysis of patient information
- Complex SQL queries for health data insights
- Data schema improvements to ensure data integrity
- SQL query optimization techniques for enhanced performance
- Clone the repository:
https://github.com/meabhaykr/Diabetes_Prediction_SQL_Analysis
- Connect to your SQL database and set up the required tables using the provided schema.
- Import the patient data into the database to enable the analysis.
The following SQL-based tasks were executed for comprehensive analysis:
- Retrieve patient IDs and ages of all patients.
- Select all female patients older than 40.
- Calculate the average BMI of patients.
- List patients by descending blood glucose levels.
- Identify patients with both hypertension and diabetes.
- Determine the count of patients with heart disease.
- Group patients by smoking history and count smokers vs. non-smokers.
- Retrieve Patient IDs with BMI above the average.
- Find patients with the highest and lowest HbA1c levels.
- Calculate patient age in years from their birthdate.
- Rank patients by blood glucose levels within each gender.
- Update the smoking history of patients older than 50 to "Ex-smoker."
- Insert new patient data with sample values.
- Delete all patients with heart disease.
- Find patients with hypertension but not diabetes using the EXCEPT operator.
- Define a unique constraint on the "patient_id" column to enforce uniqueness.
- Create a view displaying Patient IDs, ages, and BMI values.
To improve SQL query performance, the following methods were implemented:
- Indexing: Indexes were added to columns frequently used in WHERE clauses and JOIN conditions.
- Selective Columns: Reduced data transfer by selecting only necessary columns in queries.
- Efficient WHERE Clauses: Ensured WHERE clauses were optimized without functions on indexed columns.
- JOIN Optimization: Utilized the most suitable JOIN types and optimized join conditions.
- Statistics Update: Regular updates to table statistics for more effective query planning.
- Normalization: Apply normalization principles to reduce data redundancy.
- Foreign Keys: Establish foreign key relationships for data integrity.
- Audit Trails: Implement audit trails to track data changes over time.
- Constraints and Defaults: Use default values and constraints for data integrity.
- Index Optimization: Maintain a balance with indexing for frequently queried fields.