The goal of this research is to analyze past housing data in order to predict the sell prices of homes and answer the following questions:
-
Q1 - How are Houses in Kings County Distributed and the effect of Zipcodes on price?
-
Q2 - Does distance from the city have an effect on House Price?
-
Q3 - Are Waterfront properties more expensive?
Data files used:
- cleaned_geodata.json
- data_correlation_gone.csv
- datawithnewcols.csv
- kc_house_data.csv
- Zip_Codes.geojson
By analyzing the first map, we can see that zipcodes surrounding Bellevue have more houses than zipcodes further out. Upon closer look at the second map we can see the most expensive homes are concentrated in zipcodes North of the County closer to Bellevue and the cheaper homes are located south of the county.
The North of the County has some of the most expensive properties so I belive this region is in high demand so one could expect a good amount of return by investing in real estate within ths region.
By looking at the plot we can see houses above $1 milllion are located withing 20km of Bellevue. And for homes within 50km of Seattle, Although the price/house increases the bigger the sqft, the price/sqft reduces the the further away from Seattle a house is.
I'll definitely recommend investing in homes 50km from Seattle which is also about 20km from Bellevue. By doing this you'll get more space for a less price which will also translate into higher returns.
From analyzing the violin plots we can conclude that waterfront properties are more expensive than others with a mean and median price of $1.7 million and $1.5 million. Homes without a waterfront have a mean and median price of $532,000 and $450,000 respectively.
Investing in waterfront properties would definitely be a smart move given there a difference of $1 million between the average home prices of a waterfront and no waterfront property.
For the first model the independent variables I used were:
- Grade
- Area of inside home
- Area of parking lot
- Presence of waterfront
- Age of home
- Distance from Seattle
- Distance from the airport
The equation for this model is: Price = (84890 * Area inside) + (-672.2227 * Area parking lot) + (341400 * Presence of waterfront) + (33230 * Age) + (-85020 * Distance from City) + (84890 * Distance from Airport) + (-44920 * Grade 5) + (-44720 * Grade 6) + (12400 * Grade 7) + (94390 * Grade 8) + (263600 * Grade 9) + (264300 * Grade 10) + (359800 * Grade 11)
The model uses the price of the home based on grade, then a positive adjustment is made for the area inside, presence of a waterfront, age, distance from airport, followed by a negative adjustment for area of the parking lot and distance from the city. The R2 score means theres a 71.7% variance in price.
For the first model the independent variables I used were:
- Grade
- View
- Area of inside home
- Area of parking lot
- Presence of waterfront
- Age of home
- Distance from Seattle
- Distance from the airport
The equation for this model is: Price = (92650 * Area inside) + (-674.9984 * Area parking lot) + (226400 * Presence of waterfront) + (25930 * Age) + (-86480 * Distance from City) + (51940 * Distance from Airport) + (61750 * Grade) + (33280 * View)
The model uses the price of the home based on grade, then a positive adjustment is made for the area inside, presence of a waterfront, age, distance from airport and view, followed by a negative adjustment for area of the parking lot and distance from the city. The R2 score means theres a 95.5% variance in price.
For the first model the independent variables I used were:
- Grade
- Area of inside home
- Area of parking lot
- Number of bedrooms
- Number of bathrooms
- Distance from closest city
The equation for this model is: Price = (84541 * Area inside) + (24190 * Area parking lot) + (-12510 * Number of bedrooms) + (3849.4898 * Number of bathrooms) + (-109100 * Distance from City) + (429646 * Grade 4) + (410804 * Grade 5) + (400781 * Grade 6) + (441474 * Grade 7) + (499282 * Grade 8) + (588835 * Grade 9) + (648431 * Grade 10) + (707515 * Grade 11)
The model uses the price of the home based on grade, then a positive adjustment is made for the area inside, area of parking lot, bathrooms, followed by a negative adjustment for bedrooms and distance from the city. The R2 score means theres a 70.1% variance in price.
-
Zip code Feature - Create a boundary selector to enable the model to predict price depending on zip code.
-
More Data - Collate data for the passed 20 years to see how the house prices have changed and also the distinct trends, such as malls, theatres, etc. in different zip codes and how they effect the prices of homes.
-
Explore the effects of the distance from a city on presence and size of a basement.