Course Level
CS1
Knowledge Unit
Fundamental Programming Concepts
Collection Item Type
Project
Synopsis

This final project combines key CS1 programming concepts with ethical analysis. It helps students gain experience with lists, dictionaries, for/while loops, conditional statements, file handling, and functions in Python. Through a data analysis and visualization task, the students put to action their prior knowledge of the aforementioned programming concepts, embedded with an ethics-led discussion of open source data. Open source data (or “open data”) is data that is available and accessible to anyone, including for reuse of the data [8]. Students will learn how to think critically about the ethical dimensions of their selected open source data (and future open source data), and provide an analysis of the data within its contemporary cultural context.

ACM Digital Library Entry

Recommendations

Our implementation benefited from collaborative development between the ethicists and the instructor of the computer science course. Over the course of some months, the computer science instructor met with ethicists and a designer to create an activity-based class on the ethics of open source data, accompanied by assignments, all of which prepared students to address an ethical issue connected to data that was analyzed in the course’s final project. Having a collaborator on-campus with a background in responsible computing and/or ethics generally lead the in-class session will reduce the workload for the instructor and ensure a high-quality lecture with up-to-date ethical considerations, though having a collaborator is not required. The materials provided here are the ones we developed, so small changes to the lecture and dataset options could certainly be done by the instructor alone.

An instructor who implements this project will need to cu- rate a list of relevant datasets for the students to choose from. In order to select open data sets that will be effective for this project, it is important to consider: (1) the appropriateness of the content of the dataset for the age group, (2) whether the dataset is well-documented and well-organized from a tech- nical standpoint, and (3) the ethical richness of the content. It is important that the instructor is sensitive to the possibil- ity that some data subjects might be particularly sensitive and/or difficult for students, and students should therefore be able to choose for themselves which data sets they will deal with (i.e., a student who might have grown up in closer proximity to gun violence, should not be forced to work on a project about gun violence.)

Asking the students about their majors or values will enable the instructor to include options that could be of interest to the students in the class. The open source data choices should also be selected such that they could encourage discussion of ethical considerations within the class period. For example, data about congress resignations and census- collected data on college majors by gender and employment rate sparked rich ethical analysis in discussion, while data on US births and artificial data on employee attrition were more difficult for the students to analyze. Data that is well- documented includes text describing how the data was collected and what the columns/rows in the spreadsheet correspond to.

Two good sources of well-documented and well-organized open data are through FiveThirtyEight’s GitHub, which con- tains all of the data used for data analysis and visualiza- tion on FiveThirtyEight’s website or Kaggle’s datasets. One of the benefits of using Kaggle is that it scores the “usability” of the dataset, which indicates how credible and platform-compatible the data is. Also, some datasets in Kag- gle are web scraped (e.g. the OkCupid dataset at https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles), which raises further questions around consent and the ethics of data re-use. The FiveThirtyEight data has usually been used in an existing analysis or visualization, which can be either to the benefit or detriment to students.

There will likely be code or analyses of the data currently existing, which can serve as an interesting comparison or starting point for the student’s analysis, but it will be important for the instructor to ensure that it does not interfere with the integrity of the student’s submission. An additional resource is the “Data is Plural” archive, which contains datasets which have already been cleaned (at https://data.world/jsvine/data-is-plural-archive). An additional resource is the “Data is Plural” archive, which contains datasets which have already been cleaned (at https://data.world/jsvine/data-is-plural-archive).

Before assigning the datasets to students, it is important for the instructor to first download the data and verify that it is well-formed and free to access. Then, the instructor can send the students the data themselves as opposed to requiring the student to download it online. Still, in order to complete the in-class work, the students will need to explore the website of origin in order to identify the characteristics of the dataset. We have included in the materials the data options we presented to students as a starting point; this should be modified to match the interests, computational abilities, and age of the students.

This project would be appropriate for either high school or undergraduate students in their first computer science class (i.e. CS1), as the ethics concepts do not require prior experience and are sufficiently straightforward.

Having this project be the final project allowed students to draw from many relevant technical concepts in their final implementation, including data types, file handling, functions, and data visualization/analysis. By embedding the ethical analysis and assessment into a larger body of work––rather than having, say, a one-off lecture on data ethics–students were encouraged to recognize the kinds of questions that should be asked in the course of any ethically responsible data science project.

The content and structure of the lecture can also be shifted to highlight ethical issues most relevant to the project datasets. In our implementation (in a college setting), the students re- ally gravitated towards the topic of consent; this reflects in part the fact that the first topic covered during the in-class session was that of consent and how that relates to dataset development. In a future iteration, we intend to present the questions in a non-linear format in order to de-emphasize consent as the primary ethical concern.

The scope of the programming piece can also be scaled depending on how much time you are dedicating to the final project (we allotted a month between the assignment of the project and its due date).

Engagement Highlights

The purpose of the final project is to have students implement a larger-scale solution to a problem which integrates their knowledge from the other units in the course. The project requires the use of file handling, conditionals, loops, string manipulation, dictionaries, lists, and functions. This should also serve as a fun way for students to integrate their own interests (by way of the specific file and data analyzed) with their new Python skills. Each group chooses one open source dataset from the provided options. Options of the data in our implementation included: descriptions of satellites orbiting Earth, descriptions of US mass shootings, drug use by age, college majors by gender and employment rate, congress resignations by year and party, yearly greenhouse gas emissions by country, and World Corruption Index data, among others. The variety of data options incorporates student choice and promotes interdisciplinary connecions to CS, between the students’ major fields and their programming skills.

Upon selecting their datasets (by voting for their preferences in an online poll), the groups embark on a series of pre-class, in-class, and post-class activities as part of the ethics module. The pre-class work had students research basic facts about the origins of their datasets. The in-class session was led by ethicists and included both a brief lecture component on open source data as well as discussions of the ethical considerations relevant to the specific open source datasets used in the students’ final projects. Each aspect of this project is group-oriented in order to thread well-structured collaborative learning, though each of the three required analyses can be done individually (one analysis per person) or collectively (with all students contributing to multiple analyses).

Using their selected datasets, the students (1) take in a file, (2) clean and process the data (handling any missing data), and then (3) produce three analyses of the data, by funneling the relevant information into the necessary data types.

The end result of this module is that, in addition to apply- ing various technical concepts in the context of a larger-scale project, students gain a greater appreciation of the ethical dimensions of open source data, and are more prepared to engage with such data responsibly in the future.

Computer Science Details

Programming Language
Python

Material Format and Licensing Information

Creative Commons License
CC BY