Speed up your Data Science Process using ChatGPT - Here’s How!

10 minute read

ChatGPT has now been available to the public for several months. During this time, many useful use cases have been found. The ability to write code can be used by software engineers and data scientists. In this context, it is important to see what possibilities ChatGPT offers. For this reason, we present five ways for data scientists to save several hours per day.

Speed up your Data Science Process using ChatGPT
Photo by Levart_Photographer on Unsplash

In the future, it will be essential to integrate AI into daily processes. You could also say the following:

AI will not replace you. Someone who uses AI will replace you!

These two sentences make it clear that there is no way around AI in the future. So it’s time to look at ways to optimize workflows. One thing is for sure, you won’t beat anyone who effectively integrates AI into their workflows.

Let’s take a look at five use cases!

1. Explaining code

Just take code from your current data science project that you would like to understand. If you have the code snippet, you can ask ChatGPT to explain it. We ask ChatGPT to explain a code snippet for sequencing time series.

ChatGPT Prompt:

Explain the following code:

def get_sequence(X, y, lookback, y_shift=0):
    output_data = []
    output_y = []

    for i in range(len(X) - lookback - y_shift):
        t = []
        for j in range(0, lookback):   # get data points before y
            t.append(X[i+j, :])
        output_data.append(t)
        output_y.append(y[i+(lookback-1)+y_shift])   # get labels
    return np.array(output_data), np.array(output_y)

ChatGPT Output:

Explanation of the get_sequence() function (Image by authors)

ChatGPT first specifies the input parameters and then describes step by step how the code works. We find the explanation very detailed. What do you think? The example shows that ChatGPT can help in understanding code snippets. It can save you a lot of time.

In the next section, we want to speed up the function get_sequence().


Our Online Courses and recommendations


2. Explore faster alternatives

Next, we ask ChatGPT to speed up the runtime of the function.

ChatGPT Prompt:

Speed up the running time of the functionget_sequence().

ChatGPT Output:

ChatGPT Output (Screenshot by authors)

As a result, ChatGPT gives us a new implementation and an explanation of the changes. Impressive! But is the code working correctly? Let’s check that now! We can quickly check this with a few lines of code:

X = np.random.rand(10, 3)
y = np.random.randint(2, size=10)

X_new, y_new = get_sequence(X, y, 2, 1)
X_chatgpt, y_chatgpt = get_sequence_chatgpt(X, y, 2, 1)

print(np.array_equal(X_new, X_chatgpt))   # True

First, we generate a data set X with ten rows and three columns. Then we generate ten labels y. We pass X and y to the functions and select lookback=2 and y_shift=1. Afterward, we evaluate whether the two functions return the same result. And yes, both functions return the same result. Now we still need to check if the ChatGPT function is faster.

%timeit X_new, y_new = get_sequence(X, y, 2, 1)   # 15 µs
%timeit X_chatgpt, y_chatgpt = get_sequence_chatgpt(X, y, 2, 1).  # 2.55 µs

The results show that the function of ChatGPT is seven times faster. On large data sets, this optimization will be noticeable. So we learned that you can use ChatGPT well to get more performance out of code sections.


Get our e-book LangChain for Finance


3. Translating code

Suppose a customer wants a new dashboard application based on an existing application. The existing application is a Python Dash application, and the customer wants the new application to be an R Shiny application. That means we need to transfer parts of the old Python code to R. In the example below, we translate a Precision-Recall plot from Python to R.

ChatGPT Prompt:

Translate the following code into R:

def plot_precision_recall(y_true, y_pred, algorithm_name = 'foo'):
    precision, recall, thresholds = precision_recall_curve(y_true, y_pred)

    fig = px.area(
        x=recall, y=precision,
        title=f'Precision-Recall Curve (AUC={auc(recall, precision):.4f})',
        labels=dict(x='Recall', y='Precision'),
        width=700, height=500
    )
    fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=1, y1=0
    )
    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    fig.update_xaxes(constrain='domain')
    fig.write_image(f'images/precision_recall_{algorithm_name}.png')

ChatGPT Output:

Here’s the translation of the Python code into R:

library(plotly)

plot_precision_recall <- function(y_true, y_pred, algorithm_name = 'foo') {
  prc <- PRROC::pr.curve(scores.class0 = y_pred, weights.class0 = y_true, curve = TRUE)
  precision <- prc$precision
  recall <- prc$recall
  thresholds <- prc$threshold

  fig <- plot_ly(
    x = recall, y = precision, type = "scatter", mode = "lines", fill = "tozeroy",
    width = 700, height = 500, line = list(shape = "spline")
  ) %>%
    layout(
      title = paste0("Precision-Recall Curve (AUC = ", round(PRROC::auc(x = recall, y = precision), 4), ")"),
      xaxis = list(title = "Recall"),
      yaxis = list(title = "Precision"),
      shapes = list(
        list(type = "line", x0 = 0, x1 = 1, y0 = 1, y1 = 0, line = list(dash = "dash"))
      )
    )

  fig <- fig %>% layout(yaxis = list(scaleanchor = "x", scaleratio = 1))
  fig <- fig %>% layout(xaxis = list(constrain = "domain"))

  fig %>% orca(paste0("images/precision_recall_", algorithm_name, ".png"))
}

Note that in the R version, we use the PRROC package to calculate the precision-recall curve and AUC. Also, instead of using px.area from the plotly package, we use plot_ly and set fill = "tozeroy" to fill the area under the curve. Finally, we use the orca function from the plotly package to save the plot as a PNG image.

It’s impressive how fast ChatGPT can write code. No Data Scientist in this world could have solved the task so quickly. It didn’t even take a minute, and we translated the Python code to R. But before we get excited, does the code even work? A test shows, unfortunately, no! So you, as a data scientist, have to fix the bug. Often this takes less than five minutes. The following listing shows the fixed code:

library(plotly)

plot_precision_recall <- function(y_true, y_pred, algorithm_name = 'foo') {
  prc <- PRROC::pr.curve(scores.class0 = y_pred, weights.class0 = y_true, curve = TRUE)
  precision <- prc$curve[,2] % fixed a bug here!
  recall <- prc$curve[,1] % fixed a bug here!
  thresholds <- prc$threshold
  
  fig <- plot_ly(
    x = recall, y = precision, type = "scatter", mode = "lines", fill = "tozeroy",
    width = 700, height = 500, line = list(shape = "spline")
  ) %>%
    layout(
      title = paste0("Precision-Recall Curve (AUC = ", round((prc$auc.integral), 4), ")"),
      xaxis = list(title = "Recall"),
      yaxis = list(title = "Precision"),
      shapes = list(
        list(type = "line", x0 = 0, x1 = 1, y0 = 1, y1 = 0, line = list(dash = "dash"))
      )
    )
  
  fig <- fig %>% layout(yaxis = list(scaleanchor = "x", scaleratio = 1))
  fig <- fig %>% layout(xaxis = list(constrain = "domain"))
  
  save_image(fig, paste0("images/precision_recall_", algorithm_name, ".png")) % fixed a bug here!
}

First, when you run the ChatGPT code, RStudio will show you which packages are missing. You have to install these packages first. Then, we noticed three bugs.

RStudio tells you to use the package kaleido instead of the package orca. The package orca is obsolete. And in our case, the package no longer worked either. So we replaced it with the kaleido function save_image()! That was bug number one.

Bug two and three are in the assignment of recall and precision. The assignment does not work. We get the precision and recall via prc$curve. The experiment shows that you can use ChatGPT as an assistant, but it makes mistakes. So it does not replace you but serves as a support.

4. Code documentation

We have implemented a new function and need a docstring in Google style for it. As an example, we use a function that outputs a receiver operating characteristic (roc) plot.

ChatGPT Prompt:

Create a docstring in Google style for the following code:

def plot_roc_curve(y_true, y_pred):
    fpr, tpr, thresholds = roc_curve(y_true, y_pred)

    fig = px.area(
        x=fpr, y=tpr,
        title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
        labels=dict(x='False Positive Rate', y='True Positive Rate'),
        width=700, height=500
    )
    fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=0, y1=1
    )

    fig.update_yaxes(scaleanchor='x', scaleratio=1)
    fig.update_xaxes(constrain='domain')
    fig.show()

ChatGPT Output:

Here’s an example of a docstring in Google style for the given Python function:

"""Plots a Receiver Operating Characteristic (ROC) curve.

    Args:
        y_true (array-like): True binary labels.
        y_pred (array-like): Predicted probabilities of the positive class.

    Returns:
        None.

    Raises:
        ValueError: If `y_true` or `y_pred` is empty or has only one unique value.
"""

The docstring follows the Google style guide and includes a brief description of the function, a list of arguments with their types and descriptions, a description of the return value, and a list of possible exceptions that the function may raise.

The result is really good. What do you think? ChatGPT wrote the docstring in Google style, just like we wanted. It takes extremely tedious work away from data scientists so you can concentrate on the important tasks.

5. Write example code

In this use case, we need an implementation to store a JSON file in an AWS S3 bucket. We ask ChatGPT if it can implement the desired function in Python. Take a look at the following example:

ChatGPT Prompt:

Write a function that reads a JSON file from disk and uploads it to an S3 bucket gives as a parameter. The filename is also a parameter to the function. The implementation should be in Python.

ChatGPT Output:

Sure, here’s an example implementation in Python using the boto3 library:

import json
import boto3

def upload_json_to_s3(bucket_name, file_name):
    """
    Uploads a JSON file to an S3 bucket.

    Args:
        bucket_name (str): The name of the S3 bucket to upload the file to.
        file_name (str): The name of the JSON file to upload.

    Returns:
        None.
    """

    # create an S3 client
    s3 = boto3.client('s3')

    # read the JSON file from disk
    with open(file_name, 'r') as f:
        json_data = json.load(f)

    # upload the file to S3
    s3.put_object(Body=json.dumps(json_data), Bucket=bucket_name, Key=file_name)
    
    print(f"{file_name} uploaded to {bucket_name} successfully!")

This function takes two parameters — bucket_name and file_name. It reads the JSON file from disk using the json library’s load() method, creates an S3 client using boto3, and uploads the file to S3 using the put_object() method. Finally, it prints a message indicating that the upload was successful.

ChatGPT implements the function in a few seconds. Data Scientists can save the work for such rudimentary tasks. You can use ChatGPT to quickly build code templates so you can take care of complicated tasks.

This approach will save you a lot of time. Use ChatGPT as your assistant, and you can 10x or more your output.

What you should have in mind!

We don’t let ChatGPT do our work, but we use it to increase our productivity. ChatGPT is buggy when it writes code. For this reason, you should always double-check the ChatGPT code. But this is why we are here.

As a data scientist or software engineer, you know what you are doing! That’s why you can use ChatGPT to your advantage. The story is not that AI replaces programmers. It’s that programmers using AI will replace programmers without AI. ChatGPT is just a tool. Nothing more! Take your chance and use AI to your advantage.

Thanks so much for reading. Have a great day!


💡 Do you enjoy our content and want to read super-detailed articles about data science topics? If so, be sure to check out our premium offer!


Leave a comment