Predicting Real Estate Prices

  • id — unique identified for a house
  • dateDate — house was sold
  • pricePrice — is prediction target
  • bedroomsNumber — of Bedrooms/House
  • bathroomsNumber — of bathrooms/bedrooms
  • sqft_livingsquare — footage of the home
  • sqft_lotsquare — footage of the lot
  • floorsTotal — floors (levels) in house
  • waterfront — House which has a view to a waterfront
  • view — Has been viewed
  • condition — How good the condition is ( Overall )
  • grade — overall grade given to the housing unit, based on King County grading system
  • sqft_above — square footage of house apart from basement
  • sqft_basement — square footage of the basement
  • yr_built — Built Year
  • yr_renovated — Year when house was renovated
  • zipcode — zip
  • lat — Latitude coordinate
  • long — Longitude coordinate
  • sqft_living15 — The square footage of interior housing living space for the nearest 15 neighbors
  • sqft_lot15 — The square footage of the land lots of the nearest 15 neighbors


Obtain is the first step of the process and is where we acquire our data so that we can do work. The first step was the shortest for this project because I found a good dataset to use online. It is not always so easy to find the data you need, and it can be a laborious process to acquire access to data online or scraping the web for data that is relevant. Hats off to Data Engineers for their good work.


Scrub is where data cleaning occurs. Data cleaning is notorious for being the more laborious and usually more time consuming part of the process. In data cleaning, the goal is fix any issues with the dataset so that it can be effectively used in the later steps. Some common issues that need to be fixed are having duplicate rows throughout the dataset or having no entries for one of the columns which is known as a null value.


Exploration (aka Exploration Data Analysis or EDA) is where we finally start using the data to see what insights can be drawn from it. This is usually done by creating graphs using some of the data. Some insights will lead to the manipulation of columns to make entirely new features for the houses. Below, I’ll show some of the questions I sought to answer from the dataset:

Question 1: Does renovation have a noticeable effect on price?

Question 2: Is there a difference in price between a house built in a given time period versus a house renovated in that same time period?

Question 3: Is there a difference in price based on geographical location in King’s County?

Heat map of the houses using latitude and longitude of the data in terms of price


In the explore section, I also had to account for multicollinearity since the plan was to use multiple linear regression. The model cannot have multicollinearity or intercorrelation of features. This means that related features need to be removed before building the model. Making a heat map with seaborn gives a good visualization of the correlations within the different columns as well as the correlation of each column to our target, price.


Usually in Data Science, we want to use the data we are working with to make predictions. It is in the Model section of the lifecycle that this occurs.


In interpret, we take a look at our model and see what improvements can be made. This is also the step where we can use our model to make predictions.

Closing Remarks

Despite not having the most robust model, I had a lot of fun working on this project. For those of you interested in the nitty gritty of the project, here is a link to the GitHub repository:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store