A simple overview of the Data Science work process
Hello! Today I will be walking you through a project I worked on using the King County housing dataset. For those of you who would like a link to the dataset so that you can work on it, here is the link:
Below is the information provided for each house in the dataset (21597 houses were in the dataset):
- id — unique identified for a house
- dateDate — house was sold
- pricePrice — is prediction target
- bedroomsNumber — of Bedrooms/House
- bathroomsNumber — of bathrooms/bedrooms
- sqft_livingsquare — footage of the home
- sqft_lotsquare — footage of the lot
- floorsTotal — floors (levels) in house
- waterfront — House which has a view to a waterfront
- view — Has been viewed
- condition — How good the condition is ( Overall )
- grade — overall grade given to the housing unit, based on King County grading system
- sqft_above — square footage of house apart from basement
- sqft_basement — square footage of the basement
- yr_built — Built Year
- yr_renovated — Year when house was renovated
- zipcode — zip
- lat — Latitude coordinate
- long — Longitude coordinate
- sqft_living15 — The square footage of interior housing living space for the nearest 15 neighbors
- sqft_lot15 — The square footage of the land lots of the nearest 15 neighbors
There were two main objectives for this project. The first was to use multiple linear regression to predict the price of a house in King County based on the best combinations of the features shown above.
The second objective was to get more familiar with process of doing Data Science when you’ve gotten your hands on a dataset you want to work with and that’s going to be the primary focus of this blog post. The life cycle I followed is OSEMN. OSEMN is not a perfect comparison to the daily work of a Data Scientist as I understand it, but it is a rough representation of the work process. Here is what the acronym stands for and what each step means:
Obtain — Gather Data from relevant resources
Scrub — Clean data to formats that machine understands
Explore — Find significant patterns and trends using statistical methods
Model — Construct models to predict and forecast
Interpret — Put the results into good use
Alright, now that we understand the objective and the scope of the post, let’s get into it.
Obtain is the first step of the process and is where we acquire our data so that we can do work. The first step was the shortest for this project because I found a good dataset to use online. It is not always so easy to find the data you need, and it can be a laborious process to acquire access to data online or scraping the web for data that is relevant. Hats off to Data Engineers for their good work.
Scrub is where data cleaning occurs. Data cleaning is notorious for being the more laborious and usually more time consuming part of the process. In data cleaning, the goal is fix any issues with the dataset so that it can be effectively used in the later steps. Some commons issues that need to be fixed are having duplicate rows throughout the dataset or having no entries for one of the columns which is known as a null value.
For this particular project, there were multiple issues. Duplicates could be found throughout the dataset, there were null values and there were several rows that had a ‘?’ which needed to be changed. I started by removing the duplicates and if there were differing dates, I would keep the latest date. I noticed that houses were listed presumably for each time the house had been sold so keeping the latest date made sense in this case since I wanted my dataset to reflect the most modern data. For the null values and the ‘?’, I decided to convert them all to zeroes so that they could be manipulated. Most of the missing values were in columns like waterfront or sqft_basement, so the assumption was made that those features were absent for those houses.
Exploration (aka Exploration Data Analysis or EDA) is where we finally start using the data to see what insights can be drawn from it. This is usually done by creating graphs using some of the data. Some insights will lead to the manipulation of columns to make entirely new features for the houses. Below, I’ll show some of the questions I sought to answer from the dataset:
Question 1: Does renovation have a noticeable effect on price?
Conclusion 1: Not entirely surprising, but renovating a house in King’s County has a mean price increase of $237,423 or 144.0 percent.
Question 2: Is there a difference in price between a house built in a given time period versus a house renovated in that same time period?
Conclusion 2: There seems to be a signficantly larger average price for houses that were renovated in a time period compared to houses that were newly built in that same time period. It looks as if the gap in price is growing more and more for each time period until 2010–2015 which might be due to it being a smaller time period.
Question 3: Is there a difference in price based on geographical location in King’s County?
After looking at the heat map, I thought it would be interesting to compare the houses in the northern region versus the houses in the southern region. I decided that latitude 47.5 seemed to be a good spot to split the county into two.
Conclusion 3: There is a tremendous difference (almost double, 90% difference) in pricing between Southern King County homes and Northern King County Homes.
In the explore section, I also had to account for multicollinearity since the plan was to use multiple linear regression. The model cannot have multicollinearity or intercorrelation of features. This means that related features need to be removed before building the model. Making a heat map with seaborn gives a good visualization of the correlations within the different columns as well as the correlation of each column to our target, price.
Looking at the heatmap, we can see some features are going to have to go. An example would be that we can’t use both the number of bathrooms and the sqft of the house because these both give information they are strongly correlated for obvious reasons.
Usually in Data Science, we want to use the data we are working with to make predictions. It is in the Model section of the lifecycle that this occurs.
With more exploration and deductive reasoning, many of the original columns were removed and what seemed to be the best features were kept to build the model. When looking at how each category related to price, however, I kept noticing skewness due to outliers in price. I decided to take a look using a box plot which is great for detecting outliers.
As the plot shows, there are many outliers due to expensive houses. I decided to remove everything past the upper whisker and found better relationship between my features and price. By the end of it, the last features that were left were the latitude of the house and the square footage of the house. I was happy with these features because it made sense. The square footage of a house seems like a rational feature to determine price. The latitudinal location of the house clearly made a difference from the exploration that was done earlier. So many had to go because of intercorrelation, which was accounted for using Variance Inflation Factor (VIF). Here are the results from the Ordinary Least Square (OLS) Regression model:
In interpret, we take a look at our model and see what improvements can be made. This is also the step where we can use our model to make predictions.
For this model, a test train split was used to test the model.
The model produced here leaves a lot to be desired, but I’m happy with what I got from a 1-week project. The truth is, real estate companies probably use many different models to make the prediction for housing prices and this project was really just to get some Data Science practice.
Despite not having the most robust model, I had a lot of fun working on this project. For those of you interested in the nitty gritty of the project, here is a link to the GitHub repository:
I hope you enjoyed reading this write-up for it and best of luck to you!