From ca7b53a6b5c7654c4071d742c663878dcf40d8c3 Mon Sep 17 00:00:00 2001 From: Nupur Lal Date: Tue, 23 Jul 2024 12:08:41 +0000 Subject: [PATCH 1/5] new usecase on health insurance --- .../Health_Insurance_Costs_Python.ipynb | 990 ++++++++++++++++++ 1 file changed, 990 insertions(+) create mode 100644 UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb diff --git a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb new file mode 100644 index 00000000..f15fa1fa --- /dev/null +++ b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb @@ -0,0 +1,990 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "58db127c-5729-4199-9693-ca40d036e6a4", + "metadata": {}, + "source": [ + "
\n", + "

\n", + " Predicting Medical Expenses in Healthcare \n", + "
\n", + " \"Teradata\"\n", + "

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "82681386-4777-4ffd-9280-62978e02b38f", + "metadata": {}, + "source": [ + "

Introduction

\n", + "

\n", + "In the dynamic landscape of healthcare and insurance, understanding the factors influencing medical expenses is paramount. Accurate predictions of medical expenses enable healthcare providers and insurance companies to better manage costs, improve resource allocation, and enhance customer satisfaction. By leveraging regression models, we can develop robust predictive analytics that help in understanding and forecasting medical expenses based on various factors such as age, BMI, smoking status, and more.
Healthcare providers and insurance companies face challenges in accurately predicting medical expenses for individuals. Unanticipated high medical costs can strain resources, lead to financial losses, and affect the overall quality of service. The objective is to build a regression model that accurately predicts individual medical expenses, allowing for better financial planning and risk management.
\n", + " In this demo we have taken a dataset which has information about age, sex, BMI, smoking status, number of children and region which serves as a comprehensive foundation for predictive modeling in healthcare costs. By utilizing this dataset, we can develop accurate machine learning models to forecast medical expenses for new policyholders. Each variable provides critical insights: age correlates with higher medical expenses due to chronic conditions, BMI indicates obesity-related health risks, smoking status highlights costs associated with smoking-related diseases, and the number of children reflects the medical needs of larger families. Regional differences further illustrate variations in healthcare costs and accessibility. Training models on this dataset allows insurance companies to identify key cost drivers, predict future expenses with precision, and tailor pricing strategies, thereby improving decision-making, fairness, and efficiency in insurance policy pricing.\n", + "

\n", + "\n", + "

Business Values

\n", + " \n", + "

Why Vantage?

\n", + "

One of the primary challenges faced by AI/ML projects is data collection and standardization. Data pre-processing alone can consume 70 to 80% of the time and resources before model creation can begin. With Vantage's Clearscape in-DB functions, these standardization and pre-processing steps can be performed at scale, significantly reducing the time and data movement required. Additionally, leveraging the Teradata Python Package, particularly the teradataml OpenSourceML component, allows users to utilize popular open-source machine learning packages like scikit-learn directly within the database environment. This integration eliminates the need to transfer data to the client for analysis, thereby streamlining the workflow and enhancing efficiency. The OpenSourceML package simplifies the incorporation of open-source machine learning functionalities into Vantage, providing a consistent interface for executing these algorithms. Users can employ familiar syntax and arguments, facilitating a smooth transition from traditional open-source environments to the Vantage platform.\n", + "

" + ] + }, + { + "cell_type": "markdown", + "id": "a78608a9-25d9-4369-922c-9b785c66438e", + "metadata": {}, + "source": [ + "
\n", + "

1. Connect to Vantage

" + ] + }, + { + "cell_type": "markdown", + "id": "aa82c65a-97fd-45d0-8df3-037e1d6dded6", + "metadata": {}, + "source": [ + "

Let's start by importing the libraries needed.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be5b64da-074b-4aba-83da-281a26e0ceb4", + "metadata": {}, + "outputs": [], + "source": [ + "# Standard libraries\n", + "import getpass\n", + "import warnings\n", + "\n", + "# Third-party libraries\n", + "import matplotlib.pyplot as plt \n", + "import pandas as pd\n", + "import seaborn as sns\n", + "\n", + "# Teradata libraries\n", + "from teradataml import *\n", + "display.max_rows = 5\n", + "\n", + "# Suppress warnings\n", + "warnings.filterwarnings('ignore')\n", + "warnings.simplefilter(action='ignore', category=DeprecationWarning)\n", + "warnings.simplefilter(action='ignore', category=RuntimeWarning)\n", + "warnings.simplefilter(action='ignore', category=FutureWarning)" + ] + }, + { + "cell_type": "markdown", + "id": "f42532c2-a112-4e0d-a577-dfa15c8aac2c", + "metadata": {}, + "source": [ + "

We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d109030-3654-4b09-93ac-471312ab1ce1", + "metadata": {}, + "outputs": [], + "source": [ + "%run -i ../startup.ipynb\n", + "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", + "print(eng)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "557360f4-c8cd-4c20-bf3a-9dce59bf7a4a", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "execute_sql(\"SET query_band='DEMO=PP_Health_Insurance_Costs_Python.ipynb;' UPDATE FOR SESSION;\")" + ] + }, + { + "cell_type": "markdown", + "id": "29bf243d-6576-4eb8-a12d-4f7e94adcd14", + "metadata": {}, + "source": [ + "

Getting Data for This Demo

\n", + "

We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "948ca34d-be84-43fd-9ed0-a449d2d44f07", + "metadata": {}, + "outputs": [], + "source": [ + "%run -i ../run_procedure.py \"call get_data('DEMO_Health_cloud');\"\n", + " # takes about 30 seconds, estimated space: 0 MB\n", + "#%run -i ../run_procedure.py \"call get_data('DEMO_Health_local');\" \n", + "# takes about 30 seconds, estimated space: 1 MB" + ] + }, + { + "cell_type": "markdown", + "id": "aed3dae2-a27c-4d54-b9ac-92ddbcdb220d", + "metadata": {}, + "source": [ + "

Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8c50e42-18e7-4e79-96c4-5d0d4def6e51", + "metadata": {}, + "outputs": [], + "source": [ + "%run -i ../run_procedure.py \"call space_report();\"" + ] + }, + { + "cell_type": "markdown", + "id": "8863a44e-44d4-461f-a03e-e3b2799149ea", + "metadata": {}, + "source": [ + "
\n", + "

2. Initial Data Set

\n", + "

\n", + "Let us start by analyzing the data.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be654e5b-1982-43ba-b1ce-7f53e9ea29b3", + "metadata": {}, + "outputs": [], + "source": [ + "tdf = DataFrame(in_schema(\"DEMO_Health\",\"Health_Insurance_Costs\"))\n", + "tdf.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f48d0eb-bafa-4c97-a6b7-045d513aaac0", + "metadata": {}, + "outputs": [], + "source": [ + "tdf.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "664afa97-ea40-4c39-8526-34430c2a66d6", + "metadata": {}, + "outputs": [], + "source": [ + "tdf.tdtypes" + ] + }, + { + "cell_type": "markdown", + "id": "f08f57f0-6c07-4610-9d49-335eab982e8f", + "metadata": {}, + "source": [ + "

\n", + "From the above statements we can see the columns, the size of dataset and the columntypes.
Now let us apply some Data Exploration Functions avaiable in Clearscape Analytics. First let us start by ColumnSummary.
ColumnSummary function displays Column name, datatype and other demographics like count of NULLs etc for each specified input table column

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "951bc98a-14de-4fd0-a2c9-5cc0ea4b2f50", + "metadata": {}, + "outputs": [], + "source": [ + "colsum = ColumnSummary(data=tdf,\n", + " target_columns=[':']\n", + " )\n", + "colsum.result" + ] + }, + { + "cell_type": "markdown", + "id": "63f70cda-4ac3-4c6e-bdc7-4f2918af4f55", + "metadata": {}, + "source": [ + "

CategoricalSummary function displays the distinct values and their counts for each specified input table column.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fae39c52-c701-4317-a803-065bba5f7a12", + "metadata": {}, + "outputs": [], + "source": [ + "catsum = CategoricalSummary(data=tdf,\n", + " target_columns=['Sex','Smoker','Region']\n", + " )\n", + " \n", + "catsum.result" + ] + }, + { + "cell_type": "markdown", + "id": "a6b95d34-3e44-4605-bf0c-3e0664e9f3e0", + "metadata": {}, + "source": [ + "

In our data we have Smoker as character with values yes and no, first let us convert this to integer datatype by converting yes to 1 and 0 to no.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a7ee737-3da7-4b1f-a611-16b7ee220a84", + "metadata": {}, + "outputs": [], + "source": [ + "#converting smoker to integer\n", + "tdf = tdf.assign(Smoker = case([(tdf.Smoker == 'yes', 1)], else_ = 0 ))\n", + "tdf" + ] + }, + { + "cell_type": "markdown", + "id": "0dbafc07-61d5-49a3-a40d-e12a63bfd59a", + "metadata": {}, + "source": [ + "

Next we gather statistics using UnivariateStatistics

\n", + "

Univariate analysis is the simplest form of analyzing data. UnivariateStatistics displays descriptive statistics for each specified numeric input table column.\n", + "

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e36f83bb-b5ae-4603-9b60-1728d415a3ed", + "metadata": {}, + "outputs": [], + "source": [ + "#stats analysis\n", + "obj = UnivariateStatistics(newdata=tdf,\n", + " target_columns=['Age','BMI','Smoker','Children','Charges'],\n", + " stats=['MEAN','MEDIAN','MODE','KURTOSIS','SKEWNESS','STANDARD DEVIATION',\n", + " 'SUM','PERCENTILES','MINIMUM','MAXIMUM']\n", + " )\n", + "\n", + " # Print the result DataFrame.\n", + "obj.result" + ] + }, + { + "cell_type": "markdown", + "id": "60cea520-1f9e-4326-b650-3afba95b92b0", + "metadata": {}, + "source": [ + "

Let us pivot the result of UnivariateStatistics for better readability.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a560718-5af7-4491-8e60-69306a328c08", + "metadata": {}, + "outputs": [], + "source": [ + "#pivot the result for more readbility\n", + "p = obj.result\n", + "p2=p.pivot(columns=p.ATTRIBUTE, aggfuncs=p.StatValue.max())\n", + "p2=p2.assign(drop_columns=True,\n", + " StatName=p2.StatName,\n", + " Smoker=p2.max_statvalue_smoker,\n", + " Charges=p2.max_statvalue_charges,\n", + " BMI=p2.max_statvalue_bmi,\n", + " Age=p2.max_statvalue_age,\n", + " Children=p2.max_statvalue_children)\n", + "p2.head(20)" + ] + }, + { + "cell_type": "markdown", + "id": "d4d47432-e276-42a6-a623-0832a14ede5b", + "metadata": {}, + "source": [ + "

From above we can see the Statictics generated for the numerical columns in the dataset.

" + ] + }, + { + "cell_type": "markdown", + "id": "163b95eb-6134-48b6-9ff6-c049c012dff2", + "metadata": {}, + "source": [ + "
\n", + "

2. Exploratory Data Analysis

\n", + "

Let us plot some graphs of the data to see if we can identify some patterns in the data.
Teradata machine learning package (tdml) has some basic functionality to plot graphs, by using this we ensure that the data is not moved outside and large plots can be created with ease.
For more complex graphs we can integrate with available pyhton packages, data movement will be there but we can ensure it can be done for specific scenarios. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71af358d-e4ae-4b4b-a6cb-8f3e2781bf6d", + "metadata": {}, + "outputs": [], + "source": [ + "#creating separate dataframe for plotting\n", + "df=tdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39875bd8-1c38-4423-88d4-26dfd6e71747", + "metadata": {}, + "outputs": [], + "source": [ + "df.plot(x=df.BMI, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='BMI', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"BMI vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2d937ca-681b-4bc7-bf65-e46744269ce6", + "metadata": {}, + "outputs": [], + "source": [ + "#converting female =1 and male = 0 for plotting\n", + "df = df.assign(sex_int = case([(df.Sex == 'female', 1)], else_ = 0 ))\n", + "df.plot(x=df.sex_int, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='Sex', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Sex(1-female,0-male) vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30f673a0-e566-4872-a840-9419d60f6184", + "metadata": {}, + "outputs": [], + "source": [ + "#df.plot.scatter(x='age', y='charges')\n", + "df.plot(x=df.Age, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='Age', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Age vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2371a47-c229-4d7d-813f-a1b94fa6fa47", + "metadata": {}, + "outputs": [], + "source": [ + "#df.plot.scatter(x='smoker', y='charges')\n", + "df.plot(x=df.Smoker, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='Smoker', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Smoker(1-yes, 0-no) vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "832033c8-d8c6-4d56-a059-2b2728ffed64", + "metadata": {}, + "outputs": [], + "source": [ + "df = df.assign(region_int = case([(df.Region == 'southeast', 1),\n", + " (df.Region == 'northwest', 2),\n", + " (df.Region == 'northeast', 3)\n", + " ], else_ = 0 ))\n", + "df.plot(x=df.region_int, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='Region', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Region(0-southwest,1-southeast,2-northwest,3-northeast) vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c9c9f7a-6c92-49fd-be37-e29678f1476e", + "metadata": {}, + "outputs": [], + "source": [ + "#df.plot.scatter(x='children', y='charges')\n", + "df.plot(x=df.Children, \n", + " y=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " xlabel='Region', \n", + " ylabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Children vs Charges\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dcf266dd-902f-4236-80eb-35e0d52b8557", + "metadata": {}, + "outputs": [], + "source": [ + "df1=df.to_pandas().reset_index()\n", + "boxplot = df1.boxplot(column=['BMI'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44269e67-251a-4a2d-8b1d-31a355a8ee30", + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(30,28))\n", + "for i, col in enumerate( ['Age','BMI', 'Children','Charges']):\n", + " plt.subplot(3, 3, i+1)\n", + " sns.histplot(data = df1,\n", + " x = col,\n", + " kde = True,\n", + " bins = 30,\n", + " color = 'green')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7441e64-cd63-4404-b183-a6d939ecdc31", + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(12,9))\n", + "for i,col in enumerate(['Sex','Smoker','Region', 'Children']):\n", + " plt.subplot(3,2,i+1)\n", + " x=df1[col].value_counts().reset_index()\n", + " plt.title(col)\n", + " plt.pie(x=x['count'],labels=x[col],autopct=\"%0.1f%%\",colors=sns.color_palette('muted'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a52bf6ca-203e-4da9-b9b5-7e46cb493e4c", + "metadata": {}, + "outputs": [], + "source": [ + "d= df1.drop(columns=['Sex', 'Region'])\n", + "data_corr = d.corr()\n", + "sns.heatmap(data=data_corr,annot=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efe053e9-9ec8-4260-a958-504eb771c6ea", + "metadata": {}, + "outputs": [], + "source": [ + "sns.pairplot(df1, vars = df1.iloc[:, [1] + list(range(3, 6))+[7]])\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8c4a284-1fea-455b-8d76-bd8c9e26af95", + "metadata": {}, + "outputs": [], + "source": [ + "sns.scatterplot(data=df1,x=df1.Charges,y=df1.Smoker,hue=df1.Age)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4078df7-aafd-4d5e-8d1a-c1215580bb9d", + "metadata": {}, + "outputs": [], + "source": [ + "sns.barplot(data=df1,x=df1.Smoker,y=df1.Charges,estimator=np.mean)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "749eabe5-ff3b-4763-a502-928e37fce107", + "metadata": {}, + "outputs": [], + "source": [ + "sns.barplot(data=df1,x=df1.Region,y=df1.Charges,estimator=np.mean)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca07a029-7a7d-417e-9fa6-d3165d729300", + "metadata": {}, + "outputs": [], + "source": [ + "df.plot(y=df.Age, \n", + " x=df.Charges, \n", + " kind=\"scatter\",\n", + " color=\"blue\", \n", + " grid_color='grey',\n", + " ylabel='Age', \n", + " xlabel='Charges',\n", + " grid_linestyle=\"-\",\n", + " grid_linewidth= 0.5, \n", + " marker=\"o\",\n", + " markersize=7,\n", + " title=\"Age vs Charges\")" + ] + }, + { + "cell_type": "markdown", + "id": "5df35edb-218d-4d02-8f5b-f67ddd930836", + "metadata": {}, + "source": [ + "

3. Feature Engineering Functions

\n", + "

OneHotEncodingFit outputs a table of attributes and categorical values to input to OneHotEncodingTransform which encodes them as one-hot numeric vectors.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57ae0619-f10f-45af-814e-810490d6d090", + "metadata": {}, + "outputs": [], + "source": [ + "tdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8931f33e-5c3a-41e0-a266-8990ae3a2a5b", + "metadata": {}, + "outputs": [], + "source": [ + "#one hot encoding for sex & region\n", + "#scaling for BMI\n", + "hot_fit = OneHotEncodingFit(data=tdf,\n", + " is_input_dense=True,\n", + " target_column=['Sex','Region'],\n", + " category_counts=[2,4],\n", + " approach=\"auto\")\n", + " \n", + "# Print the result DataFrame.\n", + "hot_fit.result" + ] + }, + { + "cell_type": "markdown", + "id": "176de1ac-57db-4e00-941f-f4ce3b6433de", + "metadata": {}, + "source": [ + "

ScaleFit and ScaleTransform scales specified input\n", + "table columns i.e perform the specific scale methods like standard deviation, mean etc to the input columns

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37c8f084-b7d0-4357-857a-639364602f47", + "metadata": {}, + "outputs": [], + "source": [ + "scale_fit = ScaleFit(data=tdf,\n", + " target_columns=\"BMI\",\n", + " scale_method=\"RANGE\",\n", + " miss_value=\"KEEP\",\n", + " global_scale=False)\n", + "scale_fit.output" + ] + }, + { + "cell_type": "markdown", + "id": "a154627c-16b7-4c7b-928c-e2b35eca055d", + "metadata": {}, + "source": [ + "

ColumnTransformer

\n", + "

The ColumnTransformer function transforms the entire dataset in a single operation. You only need\n", + "to provide the FIT tables to the function, and the function runs all transformations that you require in a\n", + "single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.

\n", + "

Let us put all the fit tables we have created and transform the dataset

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b8180e5-e8d4-43c2-94d3-c00f7ad0b85a", + "metadata": {}, + "outputs": [], + "source": [ + "out1 = ColumnTransformer(input_data=tdf,\n", + " scale_fit_data=scale_fit.output,\n", + " onehotencoding_fit_data=hot_fit.result,\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5c4defa-4b5f-4c12-890f-971e2e7ac751", + "metadata": {}, + "outputs": [], + "source": [ + "out1.result" + ] + }, + { + "cell_type": "markdown", + "id": "32578e8b-1722-449f-8210-5d0c35003f0c", + "metadata": {}, + "source": [ + "

We can drop the extra columns and rename the columsn to create dataframe that can be used for model creation.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38f59803-9795-4de4-98fc-0d3afeb449b9", + "metadata": {}, + "outputs": [], + "source": [ + "t=out1.result\n", + "transformed_df = out1.result.assign(drop_columns=True\n", + " ,rID=t.rID \n", + " ,Age=t.Age \n", + " ,BMI=t.BMI\n", + " ,Children=t.Children\n", + " ,Smoker=t.Smoker\n", + " ,Charges=t.Charges\n", + " ,Sex_female=t.Sex_0 \n", + " ,Sex_male=t.Sex_1 \n", + " ,Region_SW=t.Region_3 \n", + " ,Region_SE=t.Region_2 \n", + " ,Region_NW=t.Region_1 \n", + " ,Region_NE=t.Region_0 \n", + " ) \n", + "\n", + "transformed_df" + ] + }, + { + "cell_type": "markdown", + "id": "242afda7-4fa9-43e6-be57-796d84f685dd", + "metadata": {}, + "source": [ + "
\n", + "

4. Model Creation

\n", + "

Train and Test Datasets

\n", + "

Now let us divide our data in training and test datasets for model creation. We can do this by using Clearscape Analytics TrainTestSplit function.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "647d1ed9-f710-41c4-a4ac-d82d57ccffbc", + "metadata": {}, + "outputs": [], + "source": [ + "TrainTestSplit_out = TrainTestSplit(\n", + " data = transformed_df,\n", + " id_column = \"rID\",\n", + " train_size = 0.75,\n", + " test_size = 0.25,\n", + " seed = 25\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9ca3141-c8d8-4786-b0a2-9e9f65c77ba0", + "metadata": {}, + "outputs": [], + "source": [ + "df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)\n", + "df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)" + ] + }, + { + "cell_type": "markdown", + "id": "4064e7d1-8270-426f-aa5d-4a54a98e862c", + "metadata": {}, + "source": [ + "

teradataml Open-Source Machine Learning Functions

\n", + "

The Teradata Package for Python introduces teradataml open-source machine learning functions(teradataml OpenSourceML) which exposes most of the functionality of open-source packages like scikit-learn. With teradataml open-source machine learning functions, we can run these open-source packages without needing to pull the data to client.It offers a simple interface object for the open-source packages, allowing them to be used with the same syntax and arguments as the actual open-source packages' functions and classes.
Let us use the use the scikit learn models for our demo but first lets divide our dataset to train and test datasets.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a6cea05-fda2-42df-9267-703af1a4a571", + "metadata": {}, + "outputs": [], + "source": [ + "X_train = df_train.drop(['Charges','rID'], axis = 1)\n", + "y_train = df_train.select([\"Charges\"])\n", + "X_test = df_test.drop(['Charges','rID'], axis = 1)\n", + "y_test = df_test.select([\"Charges\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c50fa683-8d81-4382-a407-b15b62334f78", + "metadata": {}, + "outputs": [], + "source": [ + "X_train" + ] + }, + { + "cell_type": "markdown", + "id": "482fc4ad-f8e6-4e5f-89b1-306c042ff8d5", + "metadata": {}, + "source": [ + "

Linear Regression

\n", + "

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fitting straight line that minimizes the sum of squared differences between the observed values and the predicted values. The resulting equation can be used to predict the value of the dependent variable based on the values of the independent variables.
We are using the LinearRegression classifier available in the teradataml OpenSourceML library which simulates the same behaviour as scikit-learn LinearRegression classifier.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e61199ae-830c-415d-a302-15e02d297875", + "metadata": {}, + "outputs": [], + "source": [ + "from teradataml import td_sklearn as osml\n", + "lr = osml.LinearRegression()\n", + "lr.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37ce069c-69f3-4177-9bbc-7f7380ac155e", + "metadata": {}, + "outputs": [], + "source": [ + "lr.score(X_test,y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcbf2f9e-62bd-490f-bb3a-a689e03368db", + "metadata": {}, + "outputs": [], + "source": [ + "lr.predict(X_test)" + ] + }, + { + "cell_type": "markdown", + "id": "a8d0d4dc-9f4d-4043-86ef-bbb657785cce", + "metadata": {}, + "source": [ + "

RandomForestRegressor

\n", + "

The RandomForestRegressor is an ensemble learning method used for regression tasks that combines the predictions of multiple decision trees to improve accuracy and control overfitting. It operates by constructing a multitude of decision trees during training, where each tree is built using a random subset of the data and a random subset of features, which enhances diversity among the trees. When making predictions, the RandomForestRegressor aggregates the outputs of all individual trees, typically by averaging their predictions, to produce a final output. This approach not only increases predictive performance but also provides insights into feature importance, making it a powerful tool for handling complex datasets with non-linear relationships and interactions among variables.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7fef10bc-8bb0-479f-b159-f36090d57a81", + "metadata": {}, + "outputs": [], + "source": [ + "model_rfr = osml.RandomForestRegressor()\n", + "model_rfr.fit(X_train,y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26b70564-e284-4763-a25d-fda7aceee123", + "metadata": {}, + "outputs": [], + "source": [ + "model_rfr.score(X_test,y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf00ea5d-0cb7-4806-98db-df4af52325ab", + "metadata": {}, + "outputs": [], + "source": [ + "model_rfr.predict(X_test)" + ] + }, + { + "cell_type": "markdown", + "id": "c6b6b0e1-fd05-442f-9a0c-1754ab668b7d", + "metadata": {}, + "source": [ + "

Conclusion

\n", + "

In this notebook we have seen that by implementing regression model to predict medical expenses, healthcare providers and insurance companies can significantly enhance their financial planning and risk management capabilities. This approach not only helps in understanding the key drivers of medical costs but also leads to improved customer satisfaction through personalized and cost-effective healthcare solutions. We have also seen the seamless integration of Teradata Vantage and ClearScape Analytics inDb functions with OpensourceML functions, empowering users with scalable data processing capabilities. By leveraging these functionalities within the database environment, users gain the flexibility to analyze vast amounts of data without the need to transfer it to client tools. This integration not only streamlines the workflow but also provides users with the choice of utilizing open-source machine learning functions directly within the database, further enhancing efficiency and expanding the range of analytical tools available.

" + ] + }, + { + "cell_type": "markdown", + "id": "345f1fc2-a7e7-4003-af99-db4474272e3d", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

5. Cleanup

\n" + ] + }, + { + "cell_type": "markdown", + "id": "233e7a2b-c5e0-47a2-82f4-9746198d2c01", + "metadata": {}, + "source": [ + "

Database and Tables

\n", + "

We will use the following code to clean up tables and databases created for this demonstration.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09982b60-f0fb-441f-9090-12c9b166d313", + "metadata": {}, + "outputs": [], + "source": [ + "%run -i ../run_procedure.py \"call remove_data('DEMO_Health');\" \n", + "#Takes 10 seconds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36b7f78e-e6c0-4471-a493-38af53905981", + "metadata": {}, + "outputs": [], + "source": [ + "remove_context()" + ] + }, + { + "cell_type": "markdown", + "id": "c06bcad9-5295-4d7e-b110-f0fe9a488b18", + "metadata": {}, + "source": [ + "
\n", + "Required Materials\n", + "

Let’s look at the elements we have available for reference for this notebook:

\n", + "\n", + "

Filters:

\n", + "
  • Industry: Healthcare
  • \n", + "
  • Functionality: Open-Source Machine Learning Functions
  • \n", + "
  • Use Case: Predicting Healthcare medical costs
  • \n", + "

    Related Resources:

    \n", + "
  • Saving Lives, Saving Costs: Predicting Heart Failure with Teradata
  • \n", + "
  • Texas Health:Transforming health care for 7 million patients through the power of data
  • \n", + "
  • Siemens Healthineers
  • \n" + ] + }, + { + "cell_type": "markdown", + "id": "761d574c-a9f3-4f8d-89fa-418380353e3d", + "metadata": {}, + "source": [ + "" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 836ab704d362df01cdaad352b5f08ad5cf978951 Mon Sep 17 00:00:00 2001 From: Nupur Lal Date: Tue, 30 Jul 2024 11:47:44 +0000 Subject: [PATCH 2/5] review comments added --- .../Health_Insurance_Costs_Python.ipynb | 97 +++++++++++++++++-- 1 file changed, 87 insertions(+), 10 deletions(-) diff --git a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb index f15fa1fa..b6f97b8e 100644 --- a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb +++ b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb @@ -243,7 +243,7 @@ "id": "a6b95d34-3e44-4605-bf0c-3e0664e9f3e0", "metadata": {}, "source": [ - "

    In our data we have Smoker as character with values yes and no, first let us convert this to integer datatype by converting yes to 1 and 0 to no.

    " + "

    In our data we have Smoker as character with values yes and no, first let us convert this to integer datatype by converting yes to 1 and no to 0.

    " ] }, { @@ -328,8 +328,8 @@ "metadata": {}, "source": [ "
    \n", - "

    2. Exploratory Data Analysis

    \n", - "

    Let us plot some graphs of the data to see if we can identify some patterns in the data.
    Teradata machine learning package (tdml) has some basic functionality to plot graphs, by using this we ensure that the data is not moved outside and large plots can be created with ease.
    For more complex graphs we can integrate with available pyhton packages, data movement will be there but we can ensure it can be done for specific scenarios. " + "

    3. Exploratory Data Analysis

    \n", + "

    Let us plot some graphs of the data to see if we can identify some patterns in the data.
    Teradata machine learning package (tdml) has some basic functionality to plot graphs, by using this we ensure that the data is not moved outside and large plots can be created with ease.
    For more complex graphs we can integrate with available pyhton packages, data movement will be there but we can ensure it can be done for specific scenarios.

    " ] }, { @@ -364,6 +364,14 @@ " title=\"BMI vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "d72a77ec-61a7-4449-8e46-c03d5f5507d6", + "metadata": {}, + "source": [ + "

    In the plot above we can see how the Charges are changing based on the BMI.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -387,6 +395,14 @@ " title=\"Sex(1-female,0-male) vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "54967d09-badb-4bb7-b9b8-3a4f5786df24", + "metadata": {}, + "source": [ + "

    In the plot above we can see how the charges are changing based on the gender of the patient.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -394,7 +410,6 @@ "metadata": {}, "outputs": [], "source": [ - "#df.plot.scatter(x='age', y='charges')\n", "df.plot(x=df.Age, \n", " y=df.Charges, \n", " kind=\"scatter\",\n", @@ -409,6 +424,14 @@ " title=\"Age vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "55b8d370-bfaf-44ac-88ad-28e7d403e1ad", + "metadata": {}, + "source": [ + "

    In the plot above we can see the charges based on the age of the patient.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -416,7 +439,6 @@ "metadata": {}, "outputs": [], "source": [ - "#df.plot.scatter(x='smoker', y='charges')\n", "df.plot(x=df.Smoker, \n", " y=df.Charges, \n", " kind=\"scatter\",\n", @@ -431,6 +453,14 @@ " title=\"Smoker(1-yes, 0-no) vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "bb199f67-011c-40b8-808c-8370a169f569", + "metadata": {}, + "source": [ + "

    In the plot above we can see how smoking is affecting the charges.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -456,6 +486,14 @@ " title=\"Region(0-southwest,1-southeast,2-northwest,3-northeast) vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "b6755c1a-1b01-4f25-952c-daadd6252d49", + "metadata": {}, + "source": [ + "

    In the plot above we can see the changes in charges based on the region of the patient.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -463,7 +501,6 @@ "metadata": {}, "outputs": [], "source": [ - "#df.plot.scatter(x='children', y='charges')\n", "df.plot(x=df.Children, \n", " y=df.Charges, \n", " kind=\"scatter\",\n", @@ -478,6 +515,14 @@ " title=\"Children vs Charges\")" ] }, + { + "cell_type": "markdown", + "id": "a2ecff0a-d254-44a1-9710-6f4498332aee", + "metadata": {}, + "source": [ + "

    In the plot above we can see the change in charges based on the number of children.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -489,6 +534,14 @@ "boxplot = df1.boxplot(column=['BMI'])" ] }, + { + "cell_type": "markdown", + "id": "10d8fd62-e036-43b5-a405-b4bcae297ab4", + "metadata": {}, + "source": [ + "

    The above plot helps us to check the outlier values in the BMI.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -523,6 +576,14 @@ " plt.pie(x=x['count'],labels=x[col],autopct=\"%0.1f%%\",colors=sns.color_palette('muted'))" ] }, + { + "cell_type": "markdown", + "id": "cc214978-2024-4ad6-b7a9-7f52cde08d35", + "metadata": {}, + "source": [ + "

    The above plot shows the percentage of each column in the dataset we have.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -535,6 +596,14 @@ "sns.heatmap(data=data_corr,annot=True)" ] }, + { + "cell_type": "markdown", + "id": "811fa6cc-e47f-4d12-bc26-494faa35d3d9", + "metadata": {}, + "source": [ + "

    The above plot shows the correlation of each attribute with each other.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -546,6 +615,14 @@ "plt.show()" ] }, + { + "cell_type": "markdown", + "id": "2c526ab5-0843-418a-b93f-c83c2a5adf4b", + "metadata": {}, + "source": [ + "

    The above plots show the pairplot of attributes with each other.

    " + ] + }, { "cell_type": "code", "execution_count": null, @@ -605,7 +682,7 @@ "id": "5df35edb-218d-4d02-8f5b-f67ddd930836", "metadata": {}, "source": [ - "

    3. Feature Engineering Functions

    \n", + "

    4. Feature Engineering Functions

    \n", "

    OneHotEncodingFit outputs a table of attributes and categorical values to input to OneHotEncodingTransform which encodes them as one-hot numeric vectors.

    " ] }, @@ -737,7 +814,7 @@ "metadata": {}, "source": [ "
    \n", - "

    4. Model Creation

    \n", + "

    5. Model Creation

    \n", "

    Train and Test Datasets

    \n", "

    Now let us divide our data in training and test datasets for model creation. We can do this by using Clearscape Analytics TrainTestSplit function.

    " ] @@ -806,7 +883,7 @@ "id": "482fc4ad-f8e6-4e5f-89b1-306c042ff8d5", "metadata": {}, "source": [ - "

    Linear Regression

    \n", + "

    5.1 Linear Regression

    \n", "

    Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fitting straight line that minimizes the sum of squared differences between the observed values and the predicted values. The resulting equation can be used to predict the value of the dependent variable based on the values of the independent variables.
    We are using the LinearRegression classifier available in the teradataml OpenSourceML library which simulates the same behaviour as scikit-learn LinearRegression classifier.

    " ] }, @@ -847,7 +924,7 @@ "id": "a8d0d4dc-9f4d-4043-86ef-bbb657785cce", "metadata": {}, "source": [ - "

    RandomForestRegressor

    \n", + "

    5.2 RandomForestRegressor

    \n", "

    The RandomForestRegressor is an ensemble learning method used for regression tasks that combines the predictions of multiple decision trees to improve accuracy and control overfitting. It operates by constructing a multitude of decision trees during training, where each tree is built using a random subset of the data and a random subset of features, which enhances diversity among the trees. When making predictions, the RandomForestRegressor aggregates the outputs of all individual trees, typically by averaging their predictions, to produce a final output. This approach not only increases predictive performance but also provides insights into feature importance, making it a powerful tool for handling complex datasets with non-linear relationships and interactions among variables.

    " ] }, From d5b86aeac1dee0301e922cabf0a7ab473839e956 Mon Sep 17 00:00:00 2001 From: Nupur Lal Date: Tue, 30 Jul 2024 11:49:15 +0000 Subject: [PATCH 3/5] review comments added --- UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb index b6f97b8e..4629fa1d 100644 --- a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb +++ b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb @@ -682,6 +682,7 @@ "id": "5df35edb-218d-4d02-8f5b-f67ddd930836", "metadata": {}, "source": [ + "
    \n", "

    4. Feature Engineering Functions

    \n", "

    OneHotEncodingFit outputs a table of attributes and categorical values to input to OneHotEncodingTransform which encodes them as one-hot numeric vectors.

    " ] From 385cf82c531e7148dfe4c6e6a6ead522f16337d4 Mon Sep 17 00:00:00 2001 From: Nupur Lal Date: Tue, 30 Jul 2024 12:03:50 +0000 Subject: [PATCH 4/5] updated queryband --- UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb index 4629fa1d..72b3778e 100644 --- a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb +++ b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb @@ -109,7 +109,7 @@ "outputs": [], "source": [ "%%capture\n", - "execute_sql(\"SET query_band='DEMO=PP_Health_Insurance_Costs_Python.ipynb;' UPDATE FOR SESSION;\")" + "execute_sql(\"SET query_band='DEMO=Health_Insurance_Costs_Python.ipynb;' UPDATE FOR SESSION;\")" ] }, { From df2b4e4c138c9879ba586f768c1eeeec97e94aa9 Mon Sep 17 00:00:00 2001 From: Pratik Somwanshi Date: Thu, 1 Aug 2024 06:29:17 +0000 Subject: [PATCH 5/5] Formatting changes --- .../Health_Insurance_Costs_Python.ipynb | 396 ++++++++++-------- 1 file changed, 213 insertions(+), 183 deletions(-) diff --git a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb index 72b3778e..b1f9b411 100644 --- a/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb +++ b/UseCases/Health_Insurance/Health_Insurance_Costs_Python.ipynb @@ -21,18 +21,19 @@ "source": [ "

    Introduction

    \n", "

    \n", - "In the dynamic landscape of healthcare and insurance, understanding the factors influencing medical expenses is paramount. Accurate predictions of medical expenses enable healthcare providers and insurance companies to better manage costs, improve resource allocation, and enhance customer satisfaction. By leveraging regression models, we can develop robust predictive analytics that help in understanding and forecasting medical expenses based on various factors such as age, BMI, smoking status, and more.
    Healthcare providers and insurance companies face challenges in accurately predicting medical expenses for individuals. Unanticipated high medical costs can strain resources, lead to financial losses, and affect the overall quality of service. The objective is to build a regression model that accurately predicts individual medical expenses, allowing for better financial planning and risk management.
    \n", - " In this demo we have taken a dataset which has information about age, sex, BMI, smoking status, number of children and region which serves as a comprehensive foundation for predictive modeling in healthcare costs. By utilizing this dataset, we can develop accurate machine learning models to forecast medical expenses for new policyholders. Each variable provides critical insights: age correlates with higher medical expenses due to chronic conditions, BMI indicates obesity-related health risks, smoking status highlights costs associated with smoking-related diseases, and the number of children reflects the medical needs of larger families. Regional differences further illustrate variations in healthcare costs and accessibility. Training models on this dataset allows insurance companies to identify key cost drivers, predict future expenses with precision, and tailor pricing strategies, thereby improving decision-making, fairness, and efficiency in insurance policy pricing.\n", + "In the dynamic landscape of healthcare and insurance, understanding the factors influencing medical expenses is paramount. Accurate predictions of medical expenses enable healthcare providers and insurance companies to better manage costs, improve resource allocation, and enhance customer satisfaction. By leveraging regression models, we can develop robust predictive analytics that help in understanding and forecasting medical expenses based on various factors such as age, BMI, smoking status, and more.
    Healthcare providers and insurance companies face challenges in accurately predicting medical expenses for individuals. Unanticipated high medical costs can strain resources, lead to financial losses, and affect the overall quality of service. The objective is to build a regression model that accurately predicts individual medical expenses, allowing for better financial planning and risk management.\n", + "

    \n", + "In this demo we have taken a dataset which has information about age, sex, BMI, smoking status, number of children and region which serves as a comprehensive foundation for predictive modeling in healthcare costs. By utilizing this dataset, we can develop accurate machine learning models to forecast medical expenses for new policyholders. Each variable provides critical insights: age correlates with higher medical expenses due to chronic conditions, BMI indicates obesity-related health risks, smoking status highlights costs associated with smoking-related diseases, and the number of children reflects the medical needs of larger families. Regional differences further illustrate variations in healthcare costs and accessibility. Training models on this dataset allows insurance companies to identify key cost drivers, predict future expenses with precision, and tailor pricing strategies, thereby improving decision-making, fairness, and efficiency in insurance policy pricing.\n", "

    \n", "\n", "

    Business Values

    \n", "
      \n", - "
    • Accurate Prediction of Medical Expenses: Develop a regression model to predict medical expenses based on key factors.
    • \n", - "
    • Identify Key Cost Drivers: Determine the most significant factors influencing medical expenses.
    • \n", - "
    • Enhance Resource Allocation: Enable healthcare providers and insurers to allocate resources more effectively.
    • \n", - "
    • Improve Customer Satisfaction: Offer better pricing strategies and personalized plans to customers.
    • \n", - "
    \n", - "

    Why Vantage?

    \n", + "
  • Accurate Prediction of Medical Expenses: Develop a regression model to predict medical expenses based on key factors.
  • \n", + "
  • Identify Key Cost Drivers: Determine the most significant factors influencing medical expenses.
  • \n", + "
  • Enhance Resource Allocation: Enable healthcare providers and insurers to allocate resources more effectively.
  • \n", + "
  • Improve Customer Satisfaction: Offer better pricing strategies and personalized plans to customers.
  • \n", + " \n", + "

    Why Vantage?

    \n", "

    One of the primary challenges faced by AI/ML projects is data collection and standardization. Data pre-processing alone can consume 70 to 80% of the time and resources before model creation can begin. With Vantage's Clearscape in-DB functions, these standardization and pre-processing steps can be performed at scale, significantly reducing the time and data movement required. Additionally, leveraging the Teradata Python Package, particularly the teradataml OpenSourceML component, allows users to utilize popular open-source machine learning packages like scikit-learn directly within the database environment. This integration eliminates the need to transfer data to the client for analysis, thereby streamlining the workflow and enhancing efficiency. The OpenSourceML package simplifies the incorporation of open-source machine learning functionalities into Vantage, providing a consistent interface for executing these algorithms. Users can employ familiar syntax and arguments, facilitating a smooth transition from traditional open-source environments to the Vantage platform.\n", "

    " ] @@ -130,7 +131,7 @@ "source": [ "%run -i ../run_procedure.py \"call get_data('DEMO_Health_cloud');\"\n", " # takes about 30 seconds, estimated space: 0 MB\n", - "#%run -i ../run_procedure.py \"call get_data('DEMO_Health_local');\" \n", + "# %run -i ../run_procedure.py \"call get_data('DEMO_Health_local');\" \n", "# takes about 30 seconds, estimated space: 1 MB" ] }, @@ -171,7 +172,7 @@ "outputs": [], "source": [ "tdf = DataFrame(in_schema(\"DEMO_Health\",\"Health_Insurance_Costs\"))\n", - "tdf.head()" + "tdf" ] }, { @@ -210,9 +211,10 @@ "metadata": {}, "outputs": [], "source": [ - "colsum = ColumnSummary(data=tdf,\n", - " target_columns=[':']\n", - " )\n", + "colsum = ColumnSummary(\n", + " data = tdf,\n", + " target_columns = [':']\n", + ")\n", "colsum.result" ] }, @@ -231,10 +233,10 @@ "metadata": {}, "outputs": [], "source": [ - "catsum = CategoricalSummary(data=tdf,\n", - " target_columns=['Sex','Smoker','Region']\n", - " )\n", - " \n", + "catsum = CategoricalSummary(\n", + " data = tdf,\n", + " target_columns = ['Sex','Smoker','Region']\n", + ")\n", "catsum.result" ] }, @@ -276,13 +278,14 @@ "outputs": [], "source": [ "#stats analysis\n", - "obj = UnivariateStatistics(newdata=tdf,\n", - " target_columns=['Age','BMI','Smoker','Children','Charges'],\n", - " stats=['MEAN','MEDIAN','MODE','KURTOSIS','SKEWNESS','STANDARD DEVIATION',\n", - " 'SUM','PERCENTILES','MINIMUM','MAXIMUM']\n", - " )\n", + "obj = UnivariateStatistics(\n", + " newdata = tdf,\n", + " target_columns = ['Age','BMI','Smoker','Children','Charges'],\n", + " stats = ['MEAN','MEDIAN','MODE','KURTOSIS','SKEWNESS','STANDARD DEVIATION',\n", + " 'SUM','PERCENTILES','MINIMUM','MAXIMUM']\n", + ")\n", "\n", - " # Print the result DataFrame.\n", + "# Print the result DataFrame.\n", "obj.result" ] }, @@ -303,14 +306,16 @@ "source": [ "#pivot the result for more readbility\n", "p = obj.result\n", - "p2=p.pivot(columns=p.ATTRIBUTE, aggfuncs=p.StatValue.max())\n", - "p2=p2.assign(drop_columns=True,\n", - " StatName=p2.StatName,\n", - " Smoker=p2.max_statvalue_smoker,\n", - " Charges=p2.max_statvalue_charges,\n", - " BMI=p2.max_statvalue_bmi,\n", - " Age=p2.max_statvalue_age,\n", - " Children=p2.max_statvalue_children)\n", + "p2 = p.pivot(columns = p.ATTRIBUTE, aggfuncs = p.StatValue.max())\n", + "p2 = p2.assign(\n", + " drop_columns = True,\n", + " StatName = p2.StatName,\n", + " Smoker = p2.max_statvalue_smoker,\n", + " Charges = p2.max_statvalue_charges,\n", + " BMI = p2.max_statvalue_bmi,\n", + " Age = p2.max_statvalue_age,\n", + " Children = p2.max_statvalue_children\n", + ")\n", "p2.head(20)" ] }, @@ -329,7 +334,7 @@ "source": [ "
    \n", "

    3. Exploratory Data Analysis

    \n", - "

    Let us plot some graphs of the data to see if we can identify some patterns in the data.
    Teradata machine learning package (tdml) has some basic functionality to plot graphs, by using this we ensure that the data is not moved outside and large plots can be created with ease.
    For more complex graphs we can integrate with available pyhton packages, data movement will be there but we can ensure it can be done for specific scenarios.

    " + "

    Let us plot some graphs of the data to see if we can identify some patterns in the data.
    Teradata machine learning package (tdml) has some basic functionality to plot graphs, by using this we ensure that the data is not moved outside and large plots can be created with ease.
    For more complex graphs we can integrate with available python packages, data movement will be there but we can ensure it can be done for specific scenarios.

    " ] }, { @@ -339,8 +344,8 @@ "metadata": {}, "outputs": [], "source": [ - "#creating separate dataframe for plotting\n", - "df=tdf" + "# Creating separate dataframe for plotting\n", + "df = tdf" ] }, { @@ -350,18 +355,20 @@ "metadata": {}, "outputs": [], "source": [ - "df.plot(x=df.BMI, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='BMI', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"BMI vs Charges\")" + "df.plot(\n", + " x = df.BMI,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'BMI',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"BMI vs Charges\"\n", + ")" ] }, { @@ -379,20 +386,22 @@ "metadata": {}, "outputs": [], "source": [ - "#converting female =1 and male = 0 for plotting\n", + "# Converting female = 1 and male = 0 for plotting\n", "df = df.assign(sex_int = case([(df.Sex == 'female', 1)], else_ = 0 ))\n", - "df.plot(x=df.sex_int, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='Sex', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Sex(1-female,0-male) vs Charges\")" + "df.plot(\n", + " x = df.sex_int,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'Sex',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Sex(1-female,0-male) vs Charges\"\n", + ")" ] }, { @@ -410,18 +419,20 @@ "metadata": {}, "outputs": [], "source": [ - "df.plot(x=df.Age, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='Age', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Age vs Charges\")" + "df.plot(\n", + " x = df.Age,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'Age',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Age vs Charges\"\n", + ")" ] }, { @@ -439,18 +450,20 @@ "metadata": {}, "outputs": [], "source": [ - "df.plot(x=df.Smoker, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='Smoker', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Smoker(1-yes, 0-no) vs Charges\")" + "df.plot(\n", + " x = df.Smoker,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'Smoker',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Smoker(1-yes, 0-no) vs Charges\"\n", + ")" ] }, { @@ -472,18 +485,20 @@ " (df.Region == 'northwest', 2),\n", " (df.Region == 'northeast', 3)\n", " ], else_ = 0 ))\n", - "df.plot(x=df.region_int, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='Region', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Region(0-southwest,1-southeast,2-northwest,3-northeast) vs Charges\")" + "df.plot(\n", + " x = df.region_int,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'Region',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Region(0-southwest,1-southeast,2-northwest,3-northeast) vs Charges\"\n", + ")" ] }, { @@ -501,18 +516,20 @@ "metadata": {}, "outputs": [], "source": [ - "df.plot(x=df.Children, \n", - " y=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " xlabel='Region', \n", - " ylabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Children vs Charges\")" + "df.plot(\n", + " x = df.Children,\n", + " y = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " xlabel = 'Region',\n", + " ylabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Children vs Charges\"\n", + ")" ] }, { @@ -530,8 +547,8 @@ "metadata": {}, "outputs": [], "source": [ - "df1=df.to_pandas().reset_index()\n", - "boxplot = df1.boxplot(column=['BMI'])" + "df1 = df.to_pandas().reset_index()\n", + "boxplot = df1.boxplot(column = ['BMI'])" ] }, { @@ -549,14 +566,10 @@ "metadata": {}, "outputs": [], "source": [ - "plt.figure(figsize=(30,28))\n", + "plt.figure(figsize = (30,28))\n", "for i, col in enumerate( ['Age','BMI', 'Children','Charges']):\n", " plt.subplot(3, 3, i+1)\n", - " sns.histplot(data = df1,\n", - " x = col,\n", - " kde = True,\n", - " bins = 30,\n", - " color = 'green')\n", + " sns.histplot(data = df1, x = col, kde = True, bins = 30, color = 'green')\n", "\n", "plt.show()" ] @@ -568,12 +581,12 @@ "metadata": {}, "outputs": [], "source": [ - "plt.figure(figsize=(12,9))\n", + "plt.figure(figsize = (12,9))\n", "for i,col in enumerate(['Sex','Smoker','Region', 'Children']):\n", - " plt.subplot(3,2,i+1)\n", - " x=df1[col].value_counts().reset_index()\n", + " plt.subplot(3, 2, i+1)\n", + " x = df1[col].value_counts().reset_index()\n", " plt.title(col)\n", - " plt.pie(x=x['count'],labels=x[col],autopct=\"%0.1f%%\",colors=sns.color_palette('muted'))" + " plt.pie(x = x['count'], labels = x[col], autopct = \"%0.1f%%\", colors = sns.color_palette('muted'))" ] }, { @@ -591,9 +604,9 @@ "metadata": {}, "outputs": [], "source": [ - "d= df1.drop(columns=['Sex', 'Region'])\n", + "d = df1.drop(columns = ['Sex', 'Region'])\n", "data_corr = d.corr()\n", - "sns.heatmap(data=data_corr,annot=True)" + "sns.heatmap(data = data_corr, annot = True)" ] }, { @@ -630,7 +643,7 @@ "metadata": {}, "outputs": [], "source": [ - "sns.scatterplot(data=df1,x=df1.Charges,y=df1.Smoker,hue=df1.Age)\n", + "sns.scatterplot(data = df1, x = df1.Charges, y = df1.Smoker, hue = df1.Age)\n", "plt.show()" ] }, @@ -641,7 +654,7 @@ "metadata": {}, "outputs": [], "source": [ - "sns.barplot(data=df1,x=df1.Smoker,y=df1.Charges,estimator=np.mean)\n", + "sns.barplot(data = df1, x = df1.Smoker, y = df1.Charges, estimator = np.mean)\n", "plt.show()" ] }, @@ -652,7 +665,7 @@ "metadata": {}, "outputs": [], "source": [ - "sns.barplot(data=df1,x=df1.Region,y=df1.Charges,estimator=np.mean)\n", + "sns.barplot(data = df1, x = df1.Region, y = df1.Charges, estimator = np.mean)\n", "plt.show()" ] }, @@ -663,18 +676,20 @@ "metadata": {}, "outputs": [], "source": [ - "df.plot(y=df.Age, \n", - " x=df.Charges, \n", - " kind=\"scatter\",\n", - " color=\"blue\", \n", - " grid_color='grey',\n", - " ylabel='Age', \n", - " xlabel='Charges',\n", - " grid_linestyle=\"-\",\n", - " grid_linewidth= 0.5, \n", - " marker=\"o\",\n", - " markersize=7,\n", - " title=\"Age vs Charges\")" + "df.plot(\n", + " y = df.Age,\n", + " x = df.Charges,\n", + " kind = \"scatter\",\n", + " color = \"blue\",\n", + " grid_color = 'grey',\n", + " ylabel = 'Age',\n", + " xlabel = 'Charges',\n", + " grid_linestyle = \"-\",\n", + " grid_linewidth = 0.5,\n", + " marker = \"o\",\n", + " markersize = 7,\n", + " title = \"Age vs Charges\"\n", + ")" ] }, { @@ -704,14 +719,16 @@ "metadata": {}, "outputs": [], "source": [ - "#one hot encoding for sex & region\n", - "#scaling for BMI\n", - "hot_fit = OneHotEncodingFit(data=tdf,\n", - " is_input_dense=True,\n", - " target_column=['Sex','Region'],\n", - " category_counts=[2,4],\n", - " approach=\"auto\")\n", - " \n", + "# One hot encoding for sex & region\n", + "# Scaling for BMI\n", + "hot_fit = OneHotEncodingFit(\n", + " data = tdf,\n", + " is_input_dense = True,\n", + " target_column = ['Sex','Region'],\n", + " category_counts = [2,4],\n", + " approach = \"auto\"\n", + ")\n", + "\n", "# Print the result DataFrame.\n", "hot_fit.result" ] @@ -732,11 +749,14 @@ "metadata": {}, "outputs": [], "source": [ - "scale_fit = ScaleFit(data=tdf,\n", - " target_columns=\"BMI\",\n", - " scale_method=\"RANGE\",\n", - " miss_value=\"KEEP\",\n", - " global_scale=False)\n", + "scale_fit = ScaleFit(\n", + " data = tdf,\n", + " target_columns = \"BMI\",\n", + " scale_method = \"RANGE\",\n", + " miss_value = \"KEEP\",\n", + " global_scale = False\n", + ")\n", + "\n", "scale_fit.output" ] }, @@ -759,10 +779,11 @@ "metadata": {}, "outputs": [], "source": [ - "out1 = ColumnTransformer(input_data=tdf,\n", - " scale_fit_data=scale_fit.output,\n", - " onehotencoding_fit_data=hot_fit.result,\n", - " )" + "out1 = ColumnTransformer(\n", + " input_data = tdf,\n", + " scale_fit_data = scale_fit.output,\n", + " onehotencoding_fit_data = hot_fit.result,\n", + ")" ] }, { @@ -790,21 +811,22 @@ "metadata": {}, "outputs": [], "source": [ - "t=out1.result\n", - "transformed_df = out1.result.assign(drop_columns=True\n", - " ,rID=t.rID \n", - " ,Age=t.Age \n", - " ,BMI=t.BMI\n", - " ,Children=t.Children\n", - " ,Smoker=t.Smoker\n", - " ,Charges=t.Charges\n", - " ,Sex_female=t.Sex_0 \n", - " ,Sex_male=t.Sex_1 \n", - " ,Region_SW=t.Region_3 \n", - " ,Region_SE=t.Region_2 \n", - " ,Region_NW=t.Region_1 \n", - " ,Region_NE=t.Region_0 \n", - " ) \n", + "t = out1.result\n", + "transformed_df = out1.result.assign(\n", + " drop_columns = True,\n", + " rID = t.rID,\n", + " Age = t.Age,\n", + " BMI = t.BMI,\n", + " Children = t.Children,\n", + " Smoker = t.Smoker,\n", + " Charges = t.Charges,\n", + " Sex_female = t.Sex_0,\n", + " Sex_male = t.Sex_1,\n", + " Region_SW = t.Region_3,\n", + " Region_SE = t.Region_2,\n", + " Region_NW = t.Region_1,\n", + " Region_NE = t.Region_0\n", + ")\n", "\n", "transformed_df" ] @@ -828,12 +850,12 @@ "outputs": [], "source": [ "TrainTestSplit_out = TrainTestSplit(\n", - " data = transformed_df,\n", - " id_column = \"rID\",\n", - " train_size = 0.75,\n", - " test_size = 0.25,\n", - " seed = 25\n", - " )" + " data = transformed_df,\n", + " id_column = \"rID\",\n", + " train_size = 0.75,\n", + " test_size = 0.25,\n", + " seed = 25\n", + ")" ] }, { @@ -884,6 +906,7 @@ "id": "482fc4ad-f8e6-4e5f-89b1-306c042ff8d5", "metadata": {}, "source": [ + "
    \n", "

    5.1 Linear Regression

    \n", "

    Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fitting straight line that minimizes the sum of squared differences between the observed values and the predicted values. The resulting equation can be used to predict the value of the dependent variable based on the values of the independent variables.
    We are using the LinearRegression classifier available in the teradataml OpenSourceML library which simulates the same behaviour as scikit-learn LinearRegression classifier.

    " ] @@ -907,7 +930,7 @@ "metadata": {}, "outputs": [], "source": [ - "lr.score(X_test,y_test)" + "lr.score(X_test, y_test)" ] }, { @@ -925,6 +948,7 @@ "id": "a8d0d4dc-9f4d-4043-86ef-bbb657785cce", "metadata": {}, "source": [ + "
    \n", "

    5.2 RandomForestRegressor

    \n", "

    The RandomForestRegressor is an ensemble learning method used for regression tasks that combines the predictions of multiple decision trees to improve accuracy and control overfitting. It operates by constructing a multitude of decision trees during training, where each tree is built using a random subset of the data and a random subset of features, which enhances diversity among the trees. When making predictions, the RandomForestRegressor aggregates the outputs of all individual trees, typically by averaging their predictions, to produce a final output. This approach not only increases predictive performance but also provides insights into feature importance, making it a powerful tool for handling complex datasets with non-linear relationships and interactions among variables.

    " ] @@ -937,7 +961,7 @@ "outputs": [], "source": [ "model_rfr = osml.RandomForestRegressor()\n", - "model_rfr.fit(X_train,y_train)" + "model_rfr.fit(X_train, y_train)" ] }, { @@ -947,7 +971,7 @@ "metadata": {}, "outputs": [], "source": [ - "model_rfr.score(X_test,y_test)" + "model_rfr.score(X_test, y_test)" ] }, { @@ -965,6 +989,7 @@ "id": "c6b6b0e1-fd05-442f-9a0c-1754ab668b7d", "metadata": {}, "source": [ + "
    \n", "

    Conclusion

    \n", "

    In this notebook we have seen that by implementing regression model to predict medical expenses, healthcare providers and insurance companies can significantly enhance their financial planning and risk management capabilities. This approach not only helps in understanding the key drivers of medical costs but also leads to improved customer satisfaction through personalized and cost-effective healthcare solutions. We have also seen the seamless integration of Teradata Vantage and ClearScape Analytics inDb functions with OpensourceML functions, empowering users with scalable data processing capabilities. By leveraging these functionalities within the database environment, users gain the flexibility to analyze vast amounts of data without the need to transfer it to client tools. This integration not only streamlines the workflow but also provides users with the choice of utilizing open-source machine learning functions directly within the database, further enhancing efficiency and expanding the range of analytical tools available.

    " ] @@ -976,7 +1001,7 @@ "source": [ "
    \n", "\n", - "

    5. Cleanup

    \n" + "

    6. Cleanup

    \n" ] }, { @@ -1019,13 +1044,18 @@ "

    Let’s look at the elements we have available for reference for this notebook:

    \n", "\n", "

    Filters:

    \n", + "\n", + "\n", + "

    Related Resources:

    \n", + "" ] }, {