Kaggle Survey: Student Demographics & Diversity Data

Source: kaggle_survey_demographics.md


Kaggle Survey: Student Demographics & Diversity Data

Purpose

Identify Kaggle datasets for calibrating demographic parameters in the college admissions simulation: % URM students, % first-gen, gender ratios, and post-SFFA (2023) enrollment shifts.


1. College Scorecard (US Dept of Education) -- PRIMARY RECOMMENDATION

URL: https://www.kaggle.com/datasets/kaggle/college-scorecard Alt: https://www.kaggle.com/datasets/thedevastator/u-s-department-of-education-college-scorecard-da

Attribute Detail
Source US Department of Education
Years 1997--present (updated annually)
Institutions ~7,000+ Title IV institutions
Size ~589 MB
License CC0 Public Domain
Format CSV (Scorecard.csv) or SQLite

Key Demographic Variables

Variable Description
UGDS Total undergraduate enrollment
UGDS_WHITE % White undergraduates
UGDS_BLACK % Black undergraduates
UGDS_HISP % Hispanic undergraduates
UGDS_ASIAN % Asian undergraduates
UGDS_AIAN % American Indian/Alaska Native
UGDS_NHPI % Native Hawaiian/Pacific Islander
UGDS_2MOR % Two or more races
UGDS_NRA % Non-resident alien
UGDS_UNKN % Race unknown
UGDS_MEN % Male undergraduates
UGDS_WOMEN % Female undergraduates
PCTPELL % receiving Pell grants (proxy for low-income)
PCTFLOAN % receiving federal student loans
FIRST_GEN % first-generation students
PAR_ED_PCT_1STGEN % parents with no college degree

Simulation Relevance: HIGH


2. College Tuition, Diversity, and Pay (TidyTuesday)

URL: https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay GitHub: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-03-10

Attribute Detail
Source US Dept of Education, TuitionTracker, PayScale, NCES
Diversity Year 2014
Tuition Year 2018-2019
Historical Range 1985-2016 (tuition trends)
Size 1.9 MB
License MIT

Files Included

  1. tuition_cost.csv -- tuition and fees by school
  2. diversity_school.csv -- enrollment by race/ethnicity/gender category
  3. tuition_income.csv -- net cost by income bracket
  4. salary_potential.csv -- early/mid-career pay
  5. historical_tuition.csv -- tuition trends 1985-2016
  6. student_diversity.csv -- diversity percentages

diversity_school.csv Columns

Column Type Description
name character School name
total_enrollment double Total enrollment
state character State
category character Race/ethnicity/gender category
enrollment double Enrollment count for category

Data is in long format (one row per school-category pair). Categories include racial/ethnic groups and gender.

Simulation Relevance: MEDIUM


3. Economic Diversity and Student Outcomes (Opportunity Insights)

URL: https://www.kaggle.com/datasets/umerhaddii/economic-diversity-and-student-outcomes-data

Attribute Detail
Source Opportunity Insights (Chetty et al.) via NYT Upshot
Focus Economic mobility, income quintiles
Institutions ~2,200 colleges
License CC0

Key Variables

Variable Description
super_opeid Institution cluster ID
name College name
par_income_bin Parent household income group (percentile)
par_income_lab Income group label
attend Test-score-reweighted attendance rate
rel_apply Relative application rate
rel_attend Relative attendance rate

Simulation Relevance: MEDIUM-HIGH


4. College Enrollment Demographics 2021 (IPEDS)

URL: https://www.kaggle.com/datasets/nivedithavudayagiri/college-enrollment-demographics-2021

Attribute Detail
Source IPEDS (NCES)
Period July 2020 -- June 2021
Institutions All IPEDS-reporting institutions
License CC0 Public Domain

Demographics Tracked

Simulation Relevance: MEDIUM


5. "American University Data" IPEDS Dataset

URL: https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset

Attribute Detail
Source IPEDS via Tableau Public
Size 1.2 MB
Demographics SAT/ACT, ethnicity, immigration, gender

Simulation Relevance: LOW-MEDIUM


6. Historically Black Colleges and Universities (HBCU)

URL: https://www.kaggle.com/datasets/paultimothymooney/historically-black-colleges-and-universities

Attribute Detail
Source NCES / TidyTuesday
Years 1910-2016 (degree attainment), 1976-2015 (HBCU enrollment)

Files

Race/Ethnicity Categories

White, Black, Hispanic, Asian/Pacific Islander, American Indian/Alaska Native, Two or more races

Simulation Relevance: LOW


7. College Admissions (Samson Qian)

URL: https://www.kaggle.com/datasets/samsonqian/college-admissions

Attribute Detail
Alt name "Admission/Class Demographics by University"
Size 223 KB
License CC0
Year ~2018

Simulation Relevance: LOW-MEDIUM


8. Post Secondary Education Data (IPEDS)

URL: https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds

Attribute Detail
Source IPEDS/NCES
Size 19.3 MB
Focus Student loan repayment, low-income students
Updated January 2020

Simulation Relevance: LOW


Post-SFFA (Students for Fair Admissions) Enrollment Data

Background

The Supreme Court's SFFA v. Harvard/UNC decision (June 2023) ended race-conscious admissions. Two admissions cycles have occurred since: Class of 2028 (Fall 2024) and Class of 2029 (Fall 2025).

Key Data Sources (Non-Kaggle)

1. Post-SFFA Enrollment Dashboard (Class Action)

2. James Murphy's Post-SFFA Enrollment Tracker

3. College Transitions Enrollment Demographics

Class of 2028 (First Post-SFFA Cycle) Key Numbers

Institution Black (pre) Black (post) Hispanic (pre) Hispanic (post) Asian (pre) Asian (post)
Harvard 15.3% 14.0% 16.0% 11.3% -- --
MIT -- 5% -- 11% -- 47%
Yale 14% 14% 18% 19% 30% 24%
Amherst -- -8pp -- -4pp -- --
UNC 10.5% 7.8% 10.8% 10.1% -- --
WashU -- -5pp (all POC) -- -- -- --
Tufts -- -3pp -- -- -- --
UVA -- -1.4pp (Black+Asian) -- stable -- --
Duke -- +1pp (Black+Hisp combined) -- -- -- -6pp

Class of 2029 (Second Post-SFFA Cycle)


Recommendations for Simulation Calibration

Tier 1: Must-Use

  1. College Scorecard -- Race/ethnicity %, first-gen %, Pell %, gender for all 30 colleges
  2. Economic Diversity (Opportunity Insights) -- Income quintile access rates

Tier 2: Supplementary

  1. TidyTuesday Diversity -- Quick baseline (2014 data)
  2. IPEDS Enrollment 2021 -- Cross-validate Scorecard numbers

Tier 3: Post-SFFA Calibration

  1. James Murphy's Tracker -- Manual data for Class of 2028/2029 shifts
  2. College Transitions -- CDS-sourced 2024-25 enrollment

Specific Calibration Parameters

Parameter Data Source Current Sim Value Calibration Approach
% URM by college College Scorecard (UGDS_BLACK + UGDS_HISP + UGDS_AIAN + UGDS_NHPI) 1.2x hook Set per-college URM % from data; adjust post-SFFA
% First-gen College Scorecard (FIRST_GEN, PAR_ED_PCT_1STGEN) 1.4x hook Set per-college first-gen % from data
Gender ratio College Scorecard (UGDS_MEN/UGDS_WOMEN) Not modeled Most selective schools ~50/50; LACs skew female
Low-income access Opportunity Insights (par_income_bin) Not modeled Bottom-quintile access rate varies 3-20% by college
Post-SFFA shift Murphy Tracker + institutional CDS Not modeled Black enrollment -16%, Hispanic -2%, Asian +5-7% at selective schools

Data Gaps