Course Level
Other
Knowledge Unit
Fundamental Data Structures
Collection Item Type
Lecture Slides
Synopsis

This learning experience helps students gain experience and proficiency with issues regarding the ethical collection and use of data. Students will gain an appreciation for the risks associated with record-level identification, where data attributes, however innocently collected, can and have been used to violate privacy and lead to discrimination against individuals and protected classes of individuals.

ACM Digital Library Entry

Recommendations

Ultimately, there are a couple of important suggestions to provide to any instructor, independent of the context. These involve some of the issues of the tools used in the experience and also the context of the experience.

The Crazy Games implementation of Guess Who? is different from the normal game in one key area: the game presents an immutable set of questions that a player can ask so that the system can evaluate it appropriately. It is a helpful exercise to allow students to ask the appropriate questions. The instructor can print out an image of the people in the game so that students have access to it (or use it as a standalone slide). Students should be free to ask their partners whatever questions they want, not just those presented by the online game.

Do not show the students the table about the people in the game until they have played the game. The table can greatly simplify the task of identification by directing the types of questions you would ask about the individuals.

The activity presented relies upon student knowledge of places where personally identifiable information is required to be provided. Three obvious forms of that would be the social security number, the driver’s license, or a passport [1, 3]. If a student has never had to give this information, be ready to comment about places, services, or applications that require these types of information. Some examples include:
● Job application
● Financial aid application
● Insurance
● Educational institutions
● Healthcare institutions
● Airlines
● Financial institutions

Students will likely be quick to point out that the information in these tables is now connectable to each other and to governmental records due to the sharing of the identifier. Students may not identify that any part of the information now has the potential to lead back to the identifier (and thus to all of the information in all of the tables). Point out that if something were sufficiently identifiable even if it were innocuous, this would be enough to compromise all of the information stored about that person across all of these items. This is precisely how the voting records example in New York City was compromised: the voting registry had the name of the voter; the actual voting returns had the result of the vote; these were connected over a third item: the identity of the voting district [4].

A student may not be able to make the leap that the identification of an individual record in a database constitutes a violation of privacy. A loss of privacy is suffered when an action that should not have been discoverable and attributable to a person is subsequently discovered and thusly attributed. If a non- key attribute specifically ties a result to a name, that results in a loss of privacy. This is the precise reason that the US Census Bureau aggregates personal returns into census geographies: it protects the identity of individual responses [5].

Finally, the section on Proxy Discrimination [6] extends the lesson to consider the real-world impacts of decision support systems based on machine learning algorithms trained by data which can lead to discrimination against protected classes. An example from [16] related to Criminal Risk Assessments has been provided. This section may (and probably should) be updated with relevant, recent examples of such situations, leading to a discussion of the types of discrimination students may have unknowingly experienced.

References:
[1] Darrow, J. & Lichtenstein, S. (2008). Do You Really Need My Social Security Number - Data Collection Practices in the Digital Age, 10 N.C. J.L. & Tech. 1. Available at: https://scholarship.law.unc.edu/ncjolt/vol10/iss1/2
[2] Hunt, D. B,2005. Redlining, Encyclopedia of Chicago,
http://www.encyclopedia.chicagohistory.org/pages/1050.html
[3] Bopp, C., Benjamin, L. M., & Voida, A. (2019). The coerciveness of the primary key: Infrastructure problems in human services work. Proceedings of the ACM on Human- Computer Interaction, 3(CSCW), 1-26.
[4] Clark, J., Cormack, L., & Wang, S. (2021). Privacy Concerns in New York City Elections. Technical Report, Princeton University.
[5] Young, C., Martin, D., & Skinner, C. (2009). Geographically intelligent disclosure control for flexible aggregation of census data. International Journal of Geographical Information Science, 23(4), 457-482.
[6] Tschantz, M. C., (2022). What is Proxy Discrimination? In
2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22). Association for Computing Machinery, New York, NY, USA, 1993–2003. https://doi.org/10.1145/3531146.3533242

Engagement Highlights

This learning experience is designed to cross over several different types of computing courses to demonstrate how personal privacy can be compromised within a database. This experience works through numerous best practices in teaching computer science:

Interdisciplinarity: The examples discussed in the experience (either slides or addendum resources) reference two vastly different arenas: voter protection and spatial cognition. These issues provide relevance for students with interests in political science, geography, and language. This is critical because most examples for introductory database courses rely on business examples or educational data.

Discussions and groups: The exercise presented in the learning experience asks the students to play a game with a partner and use the insights from the game to analyze a dataset from a particular lens. They will reconvene and likely share that their approach worked in different ways, highlighting that the scope of the sample dictates identification potential.

Relevant content: The example in the learning experience is a relevant current event that fits in the context of political discussions of voter rights. The particular case intersects with voting laws in New York State. The embedding example could easily be applied by the students to something in their daily life where they were the only one that did something.

Additionally, the exercise focuses on a few seminal principles in critical computing:

Privacy: If a record can be identified by a non-personally identifiable piece of data, and that record furthermore includes personally identifiable information, that personally identifiable information can then be used to ascertain other information from various sources [1].

Marginalization: Sometimes an attribute in a dataset might not identify, but it does provide access to another table or information source that can provide discriminatory data. A prime example of that is the concept of red-lining, where communities can be used as a proxy for race or income [2].

[1] Darrow, J. & Lichtenstein, S. (2008). Do You Really Need My Social Security Number - Data Collection Practices in the Digital Age, 10 N.C. J.L. & Tech. 1. Available at: https://scholarship.law.unc.edu/ncjolt/vol10/iss1/2
[2] Hunt, D. B,2005. Redlining, Encyclopedia of Chicago,
http://www.encyclopedia.chicagohistory.org/pages/1050.html

Computer Science Details

Programming Language
None

Material Format and Licensing Information

Creative Commons License
CC BY-SA