I did a full analysis using the new ChatGPT Code Interpreter

Ismailouahbi
6 min readAug 17, 2023
Photo by Jonathan Kemper on Unsplash

After its introduction, ChatGPT surprises us daily with its capabilities thanks to OpenAI’s research team. This makes sense why this tool has premium services giving you access to state-of-the-art AI capabilities including ChatGPT plugins and ChatGPT code interpreter. Etc

Code interpreter

This feature gives OpenAI’s models the capability to work with Python as a primary language for the moment, execute it and obtain the source code as well, isn’t this a fantastic feature?

The below images show some of the ChatGPT code interpreter capabilities:

Main interface (ChatGPT plugins (openai.com))

As a first step, you have to choose the code interpreter option from the Model dropdown list, and through the below image we will explore various prompts given to this model along with their outputs:

1/sin(x) plot by code interpreter (ChatGPT plugins (openai.com))

Now that we have the generated plot, let us explore it closer by applying a zoom-in:

zoom in a range between 0 and 1 (ChatGPT plugins (openai.com))

As we can see, this functionality adds a lot of capabilities to the existing model making it more interactive with humans and their needs.

Let us jump into the funny part in which I did a mini data science project involving exploring and transforming a dataset as well as building a binary classification algorithm for our business problem.

Full analysis of the titanic dataset using the new ChatGPT Code Interpreter

Photo by Isaac Smith on Unsplash

The below image shows the used prompt to analyze the titanic dataset

The Prompt used to conduct this analysis

Note:

Almost all the below code (with comments) and the descriptions were generated by ChatGPT.

Some chunks of code were modified manually for more readability.

# First, let's load the dataset and take a look at its first few rows.

import pandas as pd

# Load the dataset
titanic_df = pd.read_csv('./data/titanic.csv')

# Display the first few rows of the dataset
titanic_df.head()
titanic_df.head() output

The dataset contains the following columns:

  • Survived: This is the target variable we are trying to predict (0 = No; 1 = Yes).
  • Pclass: Passenger’s class (1 = 1st; 2 = 2nd; 3 = 3rd). This is a proxy for socio-economic status (1st = Upper; 2nd = Middle; 3rd = Lower).
  • Sex: Passenger’s sex (male or female).
  • Age: Passenger’s age. Some fractions are present if the age was estimated.
  • Siblings/Spouses: Number of siblings/spouses the passenger has aboard the Titanic.
  • Parents/Children: Number of parents/children the passenger has aboard the Titanic.
  • Fare: The fare the passenger paid for the ticket.

Next, we should check if the dataset has any missing values. Let’s do that.

# Check for missing values
titanic_df.isnull().sum()
Number of null values for each column

The dataset does not have any missing values in any of the columns, which is great because it means we don't have to do any imputation or removal of missing data.

Before we proceed to build a classification model, we need to prepare the data.

The Sex column is categorical and needs to be converted to a numerical form. We can do this using one-hot encoding or label encoding.

Let’s proceed with label encoding for simplicity, where we will assign 0 for ‘female’ and 1 for ‘male’. We will use the LabelEncoder class from sklearn.preprocessing.

Then, we will split the data into a training set and a test set. We will use 80% of the data for training the model and 20% for testing its performance. This is typically done to evaluate how well the model can generalize to new, unseen data.

We will use the train_test_split function from sklearn.model_selection for this purpose.

After splitting the data, we can proceed to build and train a classification model. Let’s start with a simple model such as Logistic Regression, which is a good baseline for binary classification problems. This model can be created using the LogisticRegression class from sklearn.linear_model.

We train the model using the fit method, and then we can evaluate its performance on the test set using the score method, which returns the accuracy of the model. Accuracy is simply the proportion of predictions that the model got correct.

Let’s proceed with these steps

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Convert 'Sex' column to numerical form
le = LabelEncoder()
titanic_df['Sex'] = le.fit_transform(titanic_df['Sex'])

# Split the data into features (X) and target (y)
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Check the accuracy of the model on the test data
accuracy = model.score(X_test, y_test)

accuracy

## output: 0.7528089887640449

The logistic regression model’s accuracy on the test set is approximately 75.28%. This means that the model correctly predicted whether a passenger survived or not in about 75.28% of the cases in the test set.

Conculsion

While this is a decent start, we could improve this performance by using more sophisticated models, tuning hyperparameters, and/or engineering features.

However, given the constraints of this question, we have demonstrated the basic process of preparing the data, building a classification model, and evaluating its performance.

Thank you!

Don’t forget to get the full code following this link!

Visit my website: Home — IsmailOuahbi.com
Follow me on LinkedIn for more.

Follow me on Twitter.

--

--

Ismailouahbi

I share my unique experiences and insights, unraveling the complexities of machine learning and data science in an engaging and accessible manner.