Example Data Science Project Following CRISP-DM Workflow
We've been hired by a group of homeowners who are concerned about the value of their homes decreasing before they sell them. They have asked us to take the raw house price data from their hometown, Ames, Iowa to be able to give them data-driven recommendations on how to best increase the value of their home.We will be following the CRISP-DM workflow for our analysis.
- Phase 1) Business Understanding
- Phase 2) Data Understanding
- Phase 3) Data Preparation
- Phase 4) Modeling
- Phase 5) Evaluation
- Phase 6) Deployment
Our stakeholders are:
- People who already own homes in Ames, Iowa
Their primary goal is:
- Increase the resale value of their homes.
They plan to:
- Modify/renovate their homes based on our analysis.
What do they need/expect?
- Actionable insights/recommendations for which modifications they can make to increase the price of their homes.
The stakeholders have provided us with two links:
- Share URL to a .csv file
- A spreadsheet of various features of homes in their town, as well as the price of the house at the time of sale.
- A Data Dictionary File
- A data dictionary is a document that lists the name and explanation for every feature in a dataset.
(Note, this is a modified version of the original Ames Iowa Housing dataset found on Kaggle)
-
The file had 2,959 rows and 38 columns.
-
There is a mixture of datatypes:
- 8 float
- 12 int
- 18 object
-
Since numeric features are sometimes stored as object dtype, we will inspect the object columns next and look for columns that should be converted.
-
Object columns that needed to be converted to numeric:
- Half Bath (had a "?"'s we replaced with NaN's
- Living Area Sqft (had to remove the characters "sqft" from each row)
Please see the Data Dictionary File for full details.
Please see the feature inspection section below for the definitions of the features that were included in the model.
After consulting the data dictionary, we noticed there are 2 features not included in the data dictionary:
-
"Unnamed: 0": There is an erroneous index column that is not in the data dictionary, and should be dropped.
-
"PID" column that is not included in the data dictionary.
- Based on the preview above it looks like it may be a unique identifier, and can be either dropped or used as the index after checking for duplicates.
There were several features with ambiguous column names. The following featurs were renamed for clarity:
- "Year Remod/Add" -> "Year Remodeled"
- "Bsmt Unf SF" -> "Bsmt Unf Sqft"
- "Total Bsmt SF" -> "Total Bsmnt Sqft"
- "TotRms AbvGrd" -> "Total Rooms"
- "Gr Liv Area" -> "Living Area Sqft"
-
In the missingno matrix plot, we can see that there are only a few columns that have missing values. Of these columns, 2 seem to have primarily null values ("Alley" and "Fence").
-
Below, we will display the null value counts and percentages for only the column with null values:
# Null | % Null | |
---|---|---|
Alley | 2732 | 93.24 |
Bsmt Unf Sqft | 1 | 0.03 |
Total Bsmnt Sqft | 1 | 0.03 |
Bsmt Full Bath | 2 | 0.07 |
Bsmt Half Bath | 2 | 0.07 |
Half Bath | 3 | 0.10 |
Garage Type | 157 | 5.36 |
Garage Yr Blt | 159 | 5.43 |
Garage Cars | 1 | 0.03 |
Garage Area | 1 | 0.03 |
Garage Qual | 159 | 5.43 |
Garage Cond | 159 | 5.43 |
Fence | 2358 | 80.48 |
- Alley and Fence have a large percentage of null values (93% and 80%, respectively).
- For Garage Columns (Garage Type, Garage Yr Built, Garage Qual, Garage Cond), the same rows are null values for all of these columns.
- This likely indicates that these homes did not have a Garage.
-
There were 7 duplicate rows that we dropped.
-
There are 22 rows with duplicate PID's (44 total).
-
In the 44 rows with duplicate PID rows, each duplicate had a duplicate PID that had a NaN for SalePrice.
- A) We cannot have null values in SalePrice since it is our target, so we will drop null values from SalePrice only.
- B) Also, by dropping the rows with null SalePrice, we may also remove the duplicate PID's.
-
So we first dropped null values from SalePrice and then confirmed there were no remiaining duplicate PID's.
-
Central Air:
- There were a small number of values in the Central Air column that had "yes" instead of "Y" and "no" instead of "N."
Value Counts for Central Air
Y 2697
N 191
yes 37
no 5
Name: Central Air, dtype: int64
- We replaced the incorrect values with "Y" and "N."
Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Year Remodeled | Bsmt Unf Sqft | Total Bsmnt Sqft | Living Area Sqft | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom | Kitchen | Total Rooms | Garage Yr Blt | Garage Cars | Garage Area | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2930.00 | 2930.00 | 2930.00 | 2930.00 | 2930.00 | 2930.00 | 2929.00 | 2929.00 | 2930.00 | 2928.00 | 2928.00 | 2930.00 | 2927.00 | 2930.00 | 2930.00 | 2930.00 | 2771.00 | 2929.00 | 2929.00 | 2930.00 |
mean | 57.48 | 10147.92 | 6.09 | 5.56 | 1971.36 | 1984.27 | 559.26 | 1051.61 | 1499.69 | 0.43 | 0.06 | 1.57 | 0.38 | 2.85 | 1.04 | 6.44 | 1978.13 | 1.77 | 472.82 | 181439.40 |
std | 33.79 | 7880.02 | 1.41 | 1.11 | 30.25 | 20.86 | 439.49 | 440.62 | 505.51 | 0.52 | 0.25 | 0.55 | 0.50 | 0.83 | 0.21 | 1.57 | 25.53 | 0.76 | 215.05 | 86659.68 |
min | -1.00 | 1300.00 | 1.00 | 1.00 | 1872.00 | 1950.00 | 0.00 | 0.00 | 334.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 1895.00 | 0.00 | 0.00 | 12789.00 |
25% | 43.00 | 7440.25 | 5.00 | 5.00 | 1954.00 | 1965.00 | 219.00 | 793.00 | 1126.00 | 0.00 | 0.00 | 1.00 | 0.00 | 2.00 | 1.00 | 5.00 | 1960.00 | 1.00 | 320.00 | 129500.00 |
50% | 63.00 | 9436.50 | 6.00 | 5.00 | 1973.00 | 1993.00 | 466.00 | 990.00 | 1442.00 | 0.00 | 0.00 | 2.00 | 0.00 | 3.00 | 1.00 | 6.00 | 1979.00 | 2.00 | 480.00 | 160000.00 |
75% | 78.00 | 11555.25 | 7.00 | 6.00 | 2001.00 | 2004.00 | 802.00 | 1302.00 | 1742.75 | 1.00 | 0.00 | 2.00 | 1.00 | 3.00 | 1.00 | 7.00 | 2002.00 | 2.00 | 576.00 | 213500.00 |
max | 313.00 | 215245.00 | 10.00 | 9.00 | 2010.00 | 2010.00 | 2336.00 | 6110.00 | 5642.00 | 3.00 | 2.00 | 4.00 | 2.00 | 8.00 | 3.00 | 15.00 | 2207.00 | 5.00 | 1488.00 | 2000000.00 |
-
Lot Frontage: has a minimum value of -1.
- This may be a placeholder value.
- We replaced all the -1's with NaN's.
-
SalePrice: The max value is much higher than the 75 percentile ($2 million vs. $213,500).
- After inspection, we decided this was not reasonable.
- The Living Area Sqft for the $2mill home is very small compared to the other most-expensive homes. 789 sqft vs. 2,400 sqft (with a price of $755,000).
- This value is not realistic, and we decided to treat the $2 million as a typo with an extra 0.
- We replaced $2 million with $200,000.
- After inspection, we decided this was not reasonable.
-
Garage Yr Built: has a max value of 2207, which is in many years into the future, and cannot be correct.
- We replaced it with a null value.
- Date Sold: we split into 2 features (Month, Year).
- "Bsmt Half Baths"/"Half Baths": were added together to make "Total Half Baths". The original features were dropped.
- "Bsmt Full Baths"/"Full Baths": were added together to make "Total Full Baths". The original features were dropped. ...
Note: only the features that were used in the final model are included below. Please see the jupyter notebook for the full feature exploration.
The following features were dropped from the model for the reasons listed below:
Feature Name | Reason Excluded |
---|---|
'Utilities' | Quasi-constant |
"Street" | Quasi-constant |
'MS Zoning' | Stakeholder can't change |
'Lot Frontage' | Stakeholder can't change |
'Lot Area' | Stakeholder can't change |
'Neighborhood' | Stakeholder can't change |
'Year Built' | Stakeholder can't change |
- "Alley": Type of alley access to property
- Grvl Gravel
- Pave Paved
- NA No alley access
- NaN's Found: 2732 (93.24%)
- Unique Values: 3
- Most common value: 'MISSING' occurs 2732 times (93.24%)
Things to check for each feature:
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical (nominal)
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 2732 null values (93.24%)
- Impute with the category shown in the data dictionary (NA)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Low (3).
- Would we know this BEFORE the target is determined?
- Yes
- Is there a business case/understanding reason to exclude based on our business case?
- It may be beyond homeowner's control.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- I would think paved alleys would get a higher price.
- Does this feature appear to be a predictor of the target?
- Possibly.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "BldgType": Type of dwelling:
- 1Fam Single-family Detached
- 2FmCon Two-family Conversion; originally built as one-family dwelling
- Duplx Duplex
- TwnhsE Townhouse End Unit
- TwnhsI Townhouse Inside Unit
- NaN's Found: 0 (0.0%)
- Unique Values: 5
- Most common value: '1Fam' occurs 2425 times (82.76%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical (nominal)
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- No need to impute.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable.
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- The homeowner may be able to convert their home to a duplex, etc.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes it does, though there is a wide range of sale prices for some of the building types.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- HouseStyle: Style of dwelling
- 1Story One story
- 1.5Fin One and one-half story: 2nd level finished
- 1.5Unf One and one-half story: 2nd level unfinished
- 2Story Two story
- 2.5Fin Two and one-half story: 2nd level finished
- 2.5Unf Two and one-half story: 2nd level unfinished
- SFoyer Split Foyer
- SLvl Split Level
- NaN's Found: 0 (0.0%)
- Unique Values: 8
- Most common value: '1Story' occurs 1481 times (50.55%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical (nominal)
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- No need to impute.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- ~medium cardinality (8)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No, the homeowner could remodel their home to change this.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Possibily, but it is hard to tell due to the range of values within some of the categories.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "OverallQual": Overall material and finish quality
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor
- NaN's Found: 0 (0.0%)
- Unique Values: 10
- Most common value: '5' occurs 825 times (28.16%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal (but already numeric)
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable.
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes
- Does this feature appear to be a predictor of the target?
- Yes!
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "OverallCond": Overall condition rating
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor
- NaN's Found: 0 (0.0%)
- Unique Values: 9
- Most common value: '5' occurs 1654 times (56.45%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal, already numeric datatype. (No encoding needed).
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- 9, not high, especially since it will treated as a numeric feature.
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Not really. The trendline is somewhat flat and the correlation is low.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Year Remodeled" (renamed from original "YearRemodAdd"):
- Remodel date (same as construction date if no remodeling or additions)
- NaN's Found: 0 (0.0%)
- Unique Values: 61
- Most common value: '1950' occurs 361 times (12.32%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric feature).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No! This is very helpful feature, since our stakeholders are open to remodeling.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes, there is a positive correlation between this and the target.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- ExterQual: Exterior material quality
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
- NaN's Found: 0 (0.0%)
- Unique Values: 4
- Most common value: 'TA' occurs 1799 times (61.4%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Low (only 4 unique values).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Very much so!
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- ExterCond: Present condition of the material on the exterior
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
- NaN's Found: 0 (0.0%)
- Unique Values: 5
- Most common value: 'TA' occurs 2549 times (87.0%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0
- Is the feature constant or quasi-constant?
- No, but TA is very common (87% of feature)
- What is the cardinality? Is it high?
- Low, only 5.
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Possibily, but the range of values for TA is very broad.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "BsmtUnfSF": Unfinished square feet of basement area
- NaN's Found: 1 (0.03%)
- Unique Values: 1137
- Most common value: '0.0' occurs 244 times (8.33%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 1 null value (0.03%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 1 (0.03%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Possibly, but unfinished basement would likely be less desirable than finished basement.
- Does this feature appear to be a predictor of the target?
- Not very, low correlation.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "TotalBsmtSF": Total square feet of basement area
- NaN's Found: 1 (0.03%)
- Unique Values: 1058
- Most common value: '0.0' occurs 79 times (2.7%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 1 null (0.03%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable, numeric.
- Would we know this BEFORE the target is determined?
- Yes
- Is there a business case/understanding reason to exclude based on our business case?
-
No
- NaN's Found: 1 (0.03%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes it does.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "CentralAir": Central air conditioning
- N: No
- Y: Yes
- NaN's Found: 0 (0.0%)
- Unique Values: 2
- Most common value: 'Y' occurs 2734 times (93.31%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0
- Is the feature constant or quasi-constant?
- No, but most homes have central air (93%)
- What is the cardinality? Is it high?
- Very low (2)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, central air is a very desirable trait.
- Does this feature appear to be a predictor of the target?
- Yes.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Living Area Sqft" (renamed from original "GrLivArea"):
- Above grade (ground) living area square feet
- NaN's Found: 0 (0.0%)
- Unique Values: 1292
- Most common value: '864.0' occurs 41 times (1.4%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, square footage is a critical house trait.
- Does this feature appear to be a predictor of the target?
- Yes!
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Bedroom": Number of bedrooms above basement level
- NaN's Found: 0 (0.0%)
- Unique Values: 8
- Most common value: '3' occurs 1597 times (54.51%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, I would think more bedrooms is better.
- Does this feature appear to be a predictor of the target?
- Possibly, though it is suprising that the largest number of bedrooms has a lower price.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Kitchen": Number of kitchens
- NaN's Found: 0 (0.0%)
- Unique Values: 4
- Most common value: '1' occurs 2796 times (95.43%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- I would think that having more kitchens would lead to a higher price.
- Does this feature appear to be a predictor of the target?
- Possibly, but there is a negative correlation when we would expect positive.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Total Rooms" (renamed from original "TotRmsAbvGrd"):
- Total rooms above grade (does not include bathrooms)
- NaN's Found: 0 (0.0%)
- Unique Values: 14
- Most common value: '6' occurs 844 times (28.81%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes it has a strong positive correlation to the target.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "GarageType": Garage location:
- 2Types: More than one type of garage
- Attchd: Attached to home
- Basment: Basement Garage
- BuiltIn: Built-In (Garage part of house - typically has room above garage)
- CarPort: Car Port
- Detchd" Detached from home
- NA: No Garage
- NaN's Found: 157 (5.36%)
- Unique Values: 7
- Most common value: 'Attchd' occurs 1731 times (59.08%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 157 (5.35%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Low-Medium (7)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes, but there is a broad range of values for some of the categories.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- GarageYrBlt: Year garage was built
- NaN's Found: 160 (5.46%)
- Unique Values: 102
- Most common value: 'nan' occurs 160 times (5.46%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 160 null values (5.46%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 160 (5.46%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, more recent garages are probably more appealing.
- Does this feature appear to be a predictor of the target?
- Yes it does! Recent garage yr builts, especially.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- GarageCars: Size of garage in car capacity
- NaN's Found: 1 (0.03%)
- Unique Values: 6
- Most common value: '2.0' occurs 1603 times (54.71%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 1 (0.03%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 1 (0.03%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes I expect more cars are more desirable.
- Does this feature appear to be a predictor of the target?
- Yes!
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- GarageArea: Size of garage in square feet
- NaN's Found: 1 (0.03%)
- Unique Values: 603
- Most common value: '0.0' occurs 157 times (5.36%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 1 null value (0.03%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 1 (0.03%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, similar to garage cars.
- Does this feature appear to be a predictor of the target?
- Yes.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "GarageQual": Garage quality:
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor
- NA: No Garage
- NaN's Found: 159 (5.43%)
- Unique Values: 6
- Most common value: 'TA' occurs 2615 times (89.25%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 159 (5.43%)
- Is the feature constant or quasi-constant?
- No, but 89% of the values are TA.
- What is the cardinality? Is it high?
- Low (6)
- Would we know this BEFORE the target is determined?
- Yes
- Is there a business case/understanding reason to exclude based on our business case?
- No
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Possibly, but the range of values in TA is broad.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "GarageCond": Garage condition:
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor
- NA: No Garage
- NaN's Found: 159 (5.43%)
- Unique Values: 6
- Most common value: 'TA' occurs 2665 times (90.96%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Ordinal
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 159 (5.43%)
- Is the feature constant or quasi-constant?
- No, but 91% of the values are TA.
- What is the cardinality? Is it high?
- Low, 6
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Possibly, but the rnage of values in TA is broad.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- PavedDrive: Paved driveway
- Y: Paved
- P: Partial Pavement
- N: Dirt/Gravel
- NaN's Found: 0 (0.0%)
- Unique Values: 3
- Most common value: 'Y' occurs 2652 times (90.51%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No, but 90% of the values are "Y"
- What is the cardinality? Is it high?
- Low (3)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No, the homeowner can get their driveway paved.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Yes.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- "Fence": Fence quality:
- GdPrv: Good Privacy
- MnPrv: Minimum Privacy
- GdWo: Good Wood
- MnWw: Minimum Wood/Wire
- NA: No Fence
- NaN's Found: 2358 (80.48%)
- Unique Values: 5
- Most common value: 'MISSING' occurs 2358 times (80.48%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Categorical
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 2,358 null values (80.48%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Low (5)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes.
- Does this feature appear to be a predictor of the target?
- Possibly, but the rnage of sale prices for homes missing a value for fench is very wide.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict for this challenge.
- NaN's Found: 0 (0.0%)
- Unique Values: 1032
- Most common value: '135000.0' occurs 34 times (1.16%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Our Target (numeric)
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable.
- Would we know this BEFORE the target is determined?
- No! It IS the target!
- Is there a business case/understanding reason to exclude based on our business case?
- No.
- Month Sold
- NaN's Found: 0 (0.0%)
- Unique Values: 12
- Most common value: '6.0' occurs 505 times (17.24%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- numeric/ordinal
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No
- What is the cardinality? Is it high?
- Medium (12)
- Would we know this BEFORE the target is determined?
- No, we wouldn't know what month the sale will happen.
- Is there a business case/understanding reason to exclude based on our business case?
- Yes, the homeowner can't control the sale date of their home.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Not sure.
- Does this feature appear to be a predictor of the target?
- No.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Year Sold
- NaN's Found: 0 (0.0%)
- Unique Values: 5
- Most common value: '2007.0' occurs 694 times (23.69%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 0 null values.
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric)
- Would we know this BEFORE the target is determined?
- No, also, these years are long ago and may not be helpful in predicting with current/modern years.
- Is there a business case/understanding reason to exclude based on our business case?
- Yes, also not fully under the homeowner's control, but also because the housing market from this time period does not account for modern yearly trends in price.
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- No.
- Does this feature appear to be a predictor of the target?
- No.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Engineered: Combined Full Baths + Bsmnt Full Baths
- NaN's Found: 2 (0.07%)
- Unique Values: 6
- Most common value: '2.0' occurs 1477 times (50.41%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 2 (0.07%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric).
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 2 (0.07%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, I would expect the more full baths the higher the price.
- Does this feature appear to be a predictor of the target?
- Yes! Strong positive correlation.
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Engineered: Combined Half Baths + Bsmnt Half Baths
- NaN's Found: 5 (0.17%)
- Unique Values: 5
- Most common value: '0.0' occurs 1706 times (58.23%)
EDA Observations
- What type of feature is it? (Categorical (nominal), ordinal, numeric)
- Numeric.
- How many null values? What percentage? What would you do with the null values (drop the rows? drop the column? impute? if impute, with what?)
- 5 (0.17%)
- Is the feature constant or quasi-constant?
- No.
- What is the cardinality? Is it high?
- Not applicable (numeric)
- Would we know this BEFORE the target is determined?
- Yes.
- Is there a business case/understanding reason to exclude based on our business case?
-
No.
- NaN's Found: 5 (0.17%)
- Feature vs. Target Observations:
- Based on your business understanding, would you expect this feature to be a predictor of the target?
- Yes, I would think more of any type of bathroom would increase price.
- Does this feature appear to be a predictor of the target?
- Somewhat, but it loks like having more than 2 half baths decrease the home value?
- Based on your business understanding, would you expect this feature to be a predictor of the target?
-
The following features were processed as numeric features:
- 'Overall Qual', 'Overall Cond', 'Year Remodeled', 'Bsmt Unf Sqft', 'Total Bsmnt Sqft', 'Living Area Sqft', 'Bedroom', 'Kitchen', 'Total Rooms', 'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Month', 'Year', 'Total Full Baths', 'Total Half Baths'
-
Missing values were imputed with the median.
-
The data was sacled using StandardScaler.
- The following features were processed as ordinal features:
- 'Exter Qual','Exter Cond', 'Garage Qual',"Garage Cond"
- Missing values were imputed using a placeholder value "NA".
- All 4 features had the same ordinal categories ('NA','Po', 'Fa', 'TA', 'Gd', 'Ex')and were encoded using OrdinalEncoder.
- The ordinally-encoded features were then scaled using StandardScaler
- The following features were processed as categorical features:
- 'Alley', 'Bldg Type', 'House Style', 'Central Air', 'Garage Type', 'Paved Drive', 'Fence'
- Missing values were imputed using a placeholder value "NA".
- The features were then encoded with OneHotEncoder.
------------------------------------------------------------
Regresion Metrics: Training Data
------------------------------------------------------------
MAE = 21,158.21
MSE = 1,151,857,608.08
RMSE = 33,939.03
R^2 = 0.83
------------------------------------------------------------
Regresion Metrics: Test Data
------------------------------------------------------------
MAE = 19,928.76
MSE = 850,685,987.08
RMSE = 29,166.52
R^2 = 0.83
------------------------------------------------------------
Regresion Metrics: Training Data
------------------------------------------------------------
MAE = 6,236.45
MSE = 108,781,002.52
RMSE = 10,429.81
R^2 = 0.98
------------------------------------------------------------
Regresion Metrics: Test Data
------------------------------------------------------------
MAE = 17,097.82
MSE = 779,491,429.10
RMSE = 27,919.37
R^2 = 0.84
Which model/result(s) should we provide to our stakeholders?
-
While our Random Forest model had the best performance on the training data ( R^2=0.98 and RMSE = $ 10,756.37), the test performance was much lower (R^2 = 0.84, RMSE= $ 27,797.16 ). It was very over fit to the training data, even after tuning max_depth.
-
While the LinearRegression performed worse on the training data compared to the RandomForest (R^2=0.83, RMSE = $ 33,939.03),tt performed equally well on the training and test data (R^2 = 0.83, RMSE = 29,166.52). It was not overfit and had consistent performance.
Do the results meet the stakeholder’s success criteria?
Not quite yet. Our stakeholders wanting some insights and recommendations, as well. (See phase 6 below).
-
The overall quality of the materials of the home were strongly correlated with higher Sale price.
-
We recommend paying for higher quality construction materials when remodeling your home, to increase its Sale Price.
-
The size of the living area (in square-feet) was also strongly positively correlated with Sale Price.
-
We recommend expanding the size of your living area in your home. This could be done by adding on to the home, or by repurposing unifinished areas of the home into liveable space.
Central Air
N 101890.479592
Y 186483.877835
Name: SalePrice, dtype: float64
-
Homes that have Central Air conditioning sell for ~$85,000 more than homes without central air.
-
We recommend adding Central Air conditioning to your home, if you do not have it already.