ChatGPT Capabilities for Data Scientists: Common Duties Automated
In the ever-evolving world of data science, the quest for efficiency is paramount. A new tool, ChatGPT, is making waves for its ability to handle routine data science tasks, streamlining the workflow and allowing data scientists to focus more on interpretation and decision-making.
The goal of a recent project was to understand why some customers did not successfully get a car by examining key matching metrics. To achieve this, a Streamlit app was built using the Gemini CLI. This app, demonstrated using a data project from Gett, a London-based taxi app, displays each step in a different tab.
ChatGPT, when prompted, can manage the entire workflow, from cleaning and organizing the data, to performing exploratory data analysis and visualization, preparing the dataset for machine learning, applying models, and even creating a Streamlit dashboard for quick interaction with the data pipeline.
The five core tasks that ChatGPT can handle in a data project are:
- Data Cleaning and Preprocessing: Handling missing data, detecting outliers, encoding categorical variables, normalizing or standardizing data.
- Exploratory Data Analysis (EDA): Summarizing datasets with descriptive statistics, identifying patterns or trends, and generating textual explanations of key metrics.
- Data Visualization: Assisting in creating charts and graphs by providing code examples to visualize data insights.
- Model Building and Evaluation: Helping to write code for training machine learning models, tuning parameters, and evaluating performance using appropriate metrics.
- Documentation and Reporting: Producing clear, concise summaries of findings, interpreting model results, and generating sections of project reports to communicate insights effectively.
These tasks, when automated, significantly reduce the time spent on routine coding and analysis, freeing up data scientists to focus on the interpretation and decision-making aspects of their work.
The data project in question, analyzing failed rider orders from Gett, encountered challenges such as missing values in both datasets. ChatGPT was able to convert date columns, drop invalid orders, and impute missing values in the m_order_eta, ensuring the data was clean and ready for analysis.
The prepared dataset for machine learning involved encoding categorical variables, scaling numerical features, and returning a clean DataFrame ready for modeling. The model's steps in the machine learning process were then explained, and the model's performance was reported with machine learning evaluation metrics like accuracy, precision, recall, and F1-score.
It's worth noting that only relevant 5 features were used in the model, further emphasizing the efficiency of ChatGPT in handling data science tasks. The project also used a basic machine learning model to predict a target variable.
Nate Rosidi, a data scientist and adjunct professor, founder of StrataScratch and a contributor to this article, highlights the practical ability of ChatGPT to manage these five key tasks via suitable prompts.
Moreover, the use of Gemini CLI for handling routine data science tasks, including building a Streamlit app that automates EDA, data cleaning, visualization, and modeling, was instrumental in the success of this project.
ChatGPT can also generate visualizations by following a provided link, a process called Retrieval-Augmented Generation, further enhancing its utility in data science projects.
The data science report by Anaconda states that data scientists spend nearly 60% of their time on cleaning and organizing data. Tools like ChatGPT and Gemini CLI are poised to revolutionize this landscape, making data science more efficient and accessible.
[1] [Data Science Report by Anaconda] [2] [Article by Nate Rosidi on StrataScratch] [3] [ChatGPT Demonstration using Gett Dataset]