I did a full analysis using the new ChatGPT `Code Interpreter`

6 min readAug 17, 2023

After its introduction, ChatGPT surprises us daily with its capabilities thanks to OpenAI’s research team. This makes sense why this tool has premium services giving you access to state-of-the-art AI capabilities including ChatGPT plugins and ChatGPT code interpreter. Etc

Code interpreter

This feature gives OpenAI’s models the capability to work with Python as a primary language for the moment, execute it and obtain the source code as well, isn’t this a fantastic feature?

The below images show some of the ChatGPT code interpreter capabilities:

Main interface (ChatGPT plugins (openai.com))

As a first step, you have to choose the code interpreter option from the Model dropdown list, and through the below image we will explore various prompts given to this model along with their outputs:

1/sin(x) plot by code interpreter (ChatGPT plugins (openai.com))

Now that we have the generated plot, let us explore it closer by applying a zoom-in:

zoom in a range between 0 and 1 (ChatGPT plugins (openai.com))

As we can see, this functionality adds a lot of capabilities to the existing model making it more interactive with humans and their needs.

Let us jump into the funny part in which I did a mini data science project involving exploring and transforming a dataset as well as building a binary classification algorithm for our business problem.

Full analysis of the titanic dataset using the new ChatGPT `Code Interpreter`

The below image shows the used prompt to analyze the titanic dataset

The Prompt used to conduct this analysis

Note:
Almost all the below code (with comments) and the descriptions were generated by ChatGPT.
Some chunks of code were modified manually for more readability.

# First, let's load the dataset and take a look at its first few rows.

import pandas as pd

# Load the dataset
titanic_df = pd.read_csv('./data/titanic.csv')

# Display the first few rows of the dataset
titanic_df.head()

The dataset contains the following columns:

Survived: This is the target variable we are trying to predict (0 = No; 1 = Yes).
Pclass: Passenger’s class (1 = 1st; 2 = 2nd; 3 = 3rd). This is a proxy for socio-economic status (1st = Upper; 2nd = Middle; 3rd = Lower).
Sex: Passenger’s sex (male or female).
Age: Passenger’s age. Some fractions are present if the age was estimated.
Siblings/Spouses: Number of siblings/spouses the passenger has aboard the Titanic.
Parents/Children: Number of parents/children the passenger has aboard the Titanic.
Fare: The fare the passenger paid for the ticket.

Next, we should check if the dataset has any missing values. Let’s do that.

# Check for missing values
titanic_df.isnull().sum()

The dataset does not have any missing values in any of the columns, which is great because it means we don't have to do any imputation or removal of missing data.

Before we proceed to build a classification model, we need to prepare the data.

The Sex column is categorical and needs to be converted to a numerical form. We can do this using one-hot encoding or label encoding.

Let’s proceed with label encoding for simplicity, where we will assign 0 for ‘female’ and 1 for ‘male’. We will use the LabelEncoder class from sklearn.preprocessing.

Then, we will split the data into a training set and a test set. We will use 80% of the data for training the model and 20% for testing its performance. This is typically done to evaluate how well the model can generalize to new, unseen data.

We will use the train_test_split function from sklearn.model_selection for this purpose.

After splitting the data, we can proceed to build and train a classification model. Let’s start with a simple model such as Logistic Regression, which is a good baseline for binary classification problems. This model can be created using the LogisticRegression class from sklearn.linear_model.

We train the model using the fit method, and then we can evaluate its performance on the test set using the score method, which returns the accuracy of the model. Accuracy is simply the proportion of predictions that the model got correct.

Let’s proceed with these steps

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Convert 'Sex' column to numerical form
le = LabelEncoder()
titanic_df['Sex'] = le.fit_transform(titanic_df['Sex'])

# Split the data into features (X) and target (y)
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Check the accuracy of the model on the test data
accuracy = model.score(X_test, y_test)

accuracy

## output: 0.7528089887640449

The logistic regression model’s accuracy on the test set is approximately 75.28%. This means that the model correctly predicted whether a passenger survived or not in about 75.28% of the cases in the test set.

Conculsion

While this is a decent start, we could improve this performance by using more sophisticated models, tuning hyperparameters, and/or engineering features.

However, given the constraints of this question, we have demonstrated the basic process of preparing the data, building a classification model, and evaluating its performance.

Thank you!

Don’t forget to get the full code following this link!

Visit my website: Home — IsmailOuahbi.com
Follow me on LinkedIn for more.

Follow me on Twitter.

My recent articles:

Dimensionality Reduction Techniques -part I-

Dealing with big quantities of data, especially unstructured data can be hard if we don’t specify which technique to…

ismailouahbi.medium.com

Gradio — Build, deploy, and share your Machine Learning Models —

In my previous article, I described how visualpython could help non-programmers run their EDA processes easily & gain…

ismailouahbi.medium.com

Black Box thinking-learn hard things-

Suppose you want to know how a function, method, or logic work without diving deep into details, which means…

ismailouahbi.medium.com

Is Jupyter notebook still practical?

Whether you are a Data scientist, Analyst, Engineer, or Machine Learning practitioner you have undoubtedly used Jupyter…

ismailouahbi.medium.com

Blackbox -AI-Powered Coding Assistant-

Introduction

ismailouahbi.medium.com

visualpython

It’s time to discover visualpython a GUI for easy dealing with repetitive data science tasks with python in addition to…

ismailouahbi.medium.com

Discovering -Git LFS-

While working on a real project, I pushed some heavy content and got back this unwanted error message:

ismailouahbi.medium.com

Data Analysis For Question Answering

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering…

ismailouahbi.medium.com

I did a full analysis using the new ChatGPT Code Interpreter

Code interpreter

Full analysis of the titanic dataset using the new ChatGPT Code Interpreter

Let’s proceed with these steps

Conculsion

While this is a decent start, we could improve this performance by using more sophisticated models, tuning hyperparameters, and/or engineering features.

However, given the constraints of this question, we have demonstrated the basic process of preparing the data, building a classification model, and evaluating its performance.

My recent articles:

Dimensionality Reduction Techniques -part I-

Dealing with big quantities of data, especially unstructured data can be hard if we don’t specify which technique to…

Gradio — Build, deploy, and share your Machine Learning Models —

In my previous article, I described how visualpython could help non-programmers run their EDA processes easily & gain…

Black Box thinking-learn hard things-

Suppose you want to know how a function, method, or logic work without diving deep into details, which means…

Is Jupyter notebook still practical?

Whether you are a Data scientist, Analyst, Engineer, or Machine Learning practitioner you have undoubtedly used Jupyter…

Blackbox -AI-Powered Coding Assistant-

Introduction

visualpython

It’s time to discover visualpython a GUI for easy dealing with repetitive data science tasks with python in addition to…

Discovering -Git LFS-

While working on a real project, I pushed some heavy content and got back this unwanted error message:

Data Analysis For Question Answering

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering…

Written by Ismailouahbi

I did a full analysis using the new ChatGPT `Code Interpreter`

Full analysis of the titanic dataset using the new ChatGPT `Code Interpreter`