Chose the W Store dataset, which was made of three sub-datasets.
Extracted column names, data types, and non-null count from each sub-dataset.
Converted datatype: "Date" to datetime, categorical cols ("IsHoliday" and "Type") to int.
Plot each sub-dataset:
- sales.csv:
  - Plotted box plots for weekly sales for 45 stores, showing the distribution.
  - Plotted weekly sales for selected store and department by defining a function.
- features.csv: Plotted 9 numerical data using sub-plots.
- stores.csv: Plotted store sizes data clearly showing 3 different types.

Joined 3 sub-datasets into one dataframe.
Dealed with missing values by removing MarkDown1-5 features.
Dealed with outliers by identifying and removing the rows with outliers.
Identified imbalanced data, which is type 2 (Type C) stores, but decided to if and how to deal with later.
Identified correlated variables by plotting correlation matrix. Then removed Fuel_Price and Type columns due to high-correlation with other variables. This step also helped deal with imbalanced data by ignoring the store types.

Checked data integrity and reset index.
Added one hot encoding columns of IsHoliday in case it was needed for further usage.
Save to local file.

Provide feedback

Saved searches