January 15, 2025

AI in data science education

10 minutes

https://blog.stephenturner.us/p/ai-in-data-science-education

Something a little different for this week’s recap. I’ve been thinking a lot lately about the practice of data science education in this era of widely available (and really good!) LLMs for code. I provide some commentary at the top based on my own data science teaching experience, with a deep dive into a few recent papers below. Audio generated with NotebookLM.

AI in the CS / data science classroom

I was a professor of public health at the University of Virginia School of Medicine for about 8 years, where I taught various flavors of an introductory biomedical data science graduate course and workshop series. I taught these courses using the pedagogical practices I picked up while becoming a Software Carpentry instructor — lots of live coding, interspersed with hands-on exercises, with homework requiring students write and submit code as part of their assignments. I’ve seen firsthand what every experienced software developer deeply understands — how hands-on coding practice, with all its frustrations and breakthrough moments, builds deep understanding. This method of teaching is effective precisely because it embraces the productive struggle of learning. Students develop robust mental models through the cycle of attempting, failing, debugging, and finally succeeding.

I think we are well into a pivotal moment in data science and computer science education. As Large Language Models (LLMs) like ChatGPT, Claude, and GitHub Copilot demonstrate increasingly sophisticated code generation abilities, educators are going to face extraordinary opportunities and profound challenges in teaching the next generation of data scientists.

The transformative potential of LLMs as learning tools can’t be understated. I recently posted about my experience using Claude to help me write code in a language I don’t write (JavaScript), for a framework I wasn’t familiar with (Chrome extensions).

In education, these AI assistants can provide instant, contextual help when students are stuck, offer clear explanations of complex code snippets, and even suggest alternative approaches to problems. They act as always-available teaching assistants, ready to engage in detailed discussions about programming concepts and implementation details. The ability to ask open-ended questions about code and receive thoughtful explanations represents an unprecedented resource for learners.

However, this easy access to AI-generated solutions raises important questions about the nature of learning itself. Will students develop the same depth of understanding if they can simply request working code rather than struggle through implementation challenges? When I used to teach, my grad students and workshop participants were practitioners — they were taking my classes because they needed to use R/Python/etc. in their daily work. If I were them now, I’d turn on Copilot or have Claude/ChatGPT pulled up alongside my VSCode/RStudio. Do we really expect students to turn Copilot off just for their homework, and leave it on for their daily work? How do we balance the undeniable benefits of AI assistance with the essential learning that comes from wrestling with difficult problems without a 90% correct completion suggestion? The risk is that we might produce data scientists who can leverage AI tools effectively but lack the foundational knowledge to reason about code and data from first principles.

As I try to follow the recent research on LLMs in data science education, including some of the papers below, these questions become increasingly pressing to me. The challenge before us is not whether to incorporate these powerful tools into our teaching, but how to do so in a way that enhances rather than short-circuits the learning process. The goal must be to harness LLMs’ capabilities while preserving the deep understanding that comes from genuine engagement with computational thinking and problem-solving.

For another perspective and great insight into this topic, I suggest following Ethan Mollick’s newsletter, One Useful Thing. His recent essay on “Post-apocalyptic education” is a worthwhile read on this topic.

Are you a CS or data science educator? How does this resonate with your thoughts? I’d love to chat more on Bluesky (@stephenturner.us).

Papers (and a talk): Deep dive

ChatGPT for Teaching and Learning: An Experience from Data Science Education

Paper: Zheng, Y. “ChatGPT for Teaching and Learning: An Experience from Data Science Education” in SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3585059.3611431.

TL;DR: A practical examination of ChatGPT's use in data science education, studying how it impacts teaching effectiveness and student learning outcomes. The study gathered perspectives from students and instructors in data science courses, revealing both opportunities and challenges specific to the field.

Summary: This study evaluates ChatGPT's integration into data science education through real-world practice and user studies with graduate students. The research investigates how effectively ChatGPT assists with tasks ranging from coding explanations to concept understanding. The findings suggest ChatGPT excels in explaining programming concepts and providing detailed API explanations but faces challenges in problem-solving scenarios. The study underscores the importance of balancing AI assistance with fundamental learning, particularly noting that ChatGPT's effectiveness varies across different aspects of data science education.

Highlights:

* User studies conducted with 28 students across multiple scenarios testing different aspects of ChatGPT's capabilities in data science education.

* Quantitative evaluation using a 1-5 scale questionnaire measuring student perceptions across various learning tasks.

* Mixed-methods approach combining student feedback with instructor observations and practical implementation tests.

Implications of ChatGPT for Data Science Education

Paper: Shen, Y. et al., “Implications of ChatGPT for Data Science Education” in SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3626252.3630874.

TL;DR: A systematic study evaluating ChatGPT's performance on data science assignments across different course levels. The research demonstrates how prompt engineering can significantly improve ChatGPT's effectiveness in solving complex data science problems.

Summary: The study evaluates ChatGPT's capabilities across three different levels of data science courses, from introductory to advanced. The researchers found that ChatGPT's performance varies significantly based on problem complexity and prompt quality, achieving high success rates with well-engineered prompts. The work provides practical insights into integrating ChatGPT into data science curricula while maintaining educational integrity.

Highlights:

* Comparative analysis across three course levels using standardized evaluation metrics.

* Development and testing of prompt engineering techniques specific to data science education.

* Cross-validation of results using multiple assignment types and difficulty levels.

* Evaluation framework for assessing LLM performance in data science education.

* Collection of example assignments and their corresponding engineered prompts provided in supplementary materials.

What Should Data Science Education Do With Large Language Models?

Paper: Tu, X., Zou, J., Su, W., & Zhang, L. “What Should Data Science Education Do With Large Language Models?” in Harvard Data Science Review, 2024. https://doi.org/10.1162/99608f92.bff007ab.

TL;DR: A comprehensive analysis of how LLMs are reshaping data science education and the necessary adaptations in teaching methods. The paper argues for a fundamental shift in how we approach data science education in the era of LLMs, focusing on developing higher-order thinking skills.

Summary: The paper examines the transformative impact of LLMs on data science education, suggesting a paradigm shift from hands-on coding to strategic planning and project management. It emphasizes the need for curriculum adaptation to balance LLM integration while maintaining core competencies. The authors propose viewing data scientists more as product managers than software engineers, focusing on strategic planning and resource coordination rather than just technical implementation.

Highlights:

* Theoretical framework development for integrating LLMs into data science education.

* Case study analysis using heart disease dataset to demonstrate LLM capabilities.

* Critical analysis of LLM limitations and educational implications.

* Prompts and responses with ChatGPT provided in supplementary materials.

Talk: Teaching and learning data science in the era of AI

Andrew Gard gave a talk at the 2024 Posit conference on teaching and learning data science in the era of AI. The talk’s premise is the fact that everyone learning data science these days (1) has ready access to Al, and (2) is strongly incentivized to use it. It’s a short, ~5 minute lightning talk. It’s worth watching, and it provides a few points of advice for data science learners.

Other papers of note

* Cognitive Apprenticeship and Artificial Intelligence Coding Assistants https://www.igi-global.com/gateway/chapter/340133

* It's Weird That it Knows What I Want: Usability and Interactions with Copilot for Novice Programmers https://dl.acm.org/doi/10.1145/3617367

* Computing Education in the Era of Generative AI https://dl.acm.org/doi/10.1145/3624720

* The Robots Are Here: Navigating the Generative AI Revolution in Computing Education https://dl.acm.org/doi/10.1145/3623762.3633499

* From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot https://dl.acm.org/doi/10.1145/3568813.3600138

* Prompt Problems: A New Programming Exercise for the Generative AI Era https://dl.acm.org/doi/10.1145/3626252.3630909

* Patterns of Student Help-Seeking When Using a Large Language Model-Powered Programming Assistant https://dl.acm.org/doi/10.1145/3636243.3636249

* Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models https://dl.acm.org/doi/10.1145/3501385.3543957

* GitHub Copilot AI pair programmer: Asset or Liability https://www.sciencedirect.com/science/article/abs/pii/S0164121223001292

Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit blog.stephenturner.us

...more

View all episodes

By Stephen Turner