Source: kaggle_survey_demographics.md
Identify Kaggle datasets for calibrating demographic parameters in the college admissions simulation: % URM students, % first-gen, gender ratios, and post-SFFA (2023) enrollment shifts.
URL: https://www.kaggle.com/datasets/kaggle/college-scorecard Alt: https://www.kaggle.com/datasets/thedevastator/u-s-department-of-education-college-scorecard-da
| Attribute | Detail |
|---|---|
| Source | US Department of Education |
| Years | 1997--present (updated annually) |
| Institutions | ~7,000+ Title IV institutions |
| Size | ~589 MB |
| License | CC0 Public Domain |
| Format | CSV (Scorecard.csv) or SQLite |
| Variable | Description |
|---|---|
UGDS |
Total undergraduate enrollment |
UGDS_WHITE |
% White undergraduates |
UGDS_BLACK |
% Black undergraduates |
UGDS_HISP |
% Hispanic undergraduates |
UGDS_ASIAN |
% Asian undergraduates |
UGDS_AIAN |
% American Indian/Alaska Native |
UGDS_NHPI |
% Native Hawaiian/Pacific Islander |
UGDS_2MOR |
% Two or more races |
UGDS_NRA |
% Non-resident alien |
UGDS_UNKN |
% Race unknown |
UGDS_MEN |
% Male undergraduates |
UGDS_WOMEN |
% Female undergraduates |
PCTPELL |
% receiving Pell grants (proxy for low-income) |
PCTFLOAN |
% receiving federal student loans |
FIRST_GEN |
% first-generation students |
PAR_ED_PCT_1STGEN |
% parents with no college degree |
Directly provides race/ethnicity percentages for all 30 colleges in our simulation
PCTPELL serves as proxy for socioeconomic diversity
FIRST_GEN calibrates the first-gen hook (currently 1.4x multiplier)
Gender ratio data (UGDS_MEN/UGDS_WOMEN) for gender-balance modeling
Multi-year data allows trend analysis
Data dictionary: https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx
URL: https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay GitHub: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-03-10
| Attribute | Detail |
|---|---|
| Source | US Dept of Education, TuitionTracker, PayScale, NCES |
| Diversity Year | 2014 |
| Tuition Year | 2018-2019 |
| Historical Range | 1985-2016 (tuition trends) |
| Size | 1.9 MB |
| License | MIT |
tuition_cost.csv -- tuition and fees by schooldiversity_school.csv -- enrollment by race/ethnicity/gender categorytuition_income.csv -- net cost by income bracketsalary_potential.csv -- early/mid-career payhistorical_tuition.csv -- tuition trends 1985-2016student_diversity.csv -- diversity percentages| Column | Type | Description |
|---|---|---|
name |
character | School name |
total_enrollment |
double | Total enrollment |
state |
character | State |
category |
character | Race/ethnicity/gender category |
enrollment |
double | Enrollment count for category |
Data is in long format (one row per school-category pair). Categories include racial/ethnic groups and gender.
Easy to use (clean, small, well-documented)
Good for quick prototyping of diversity parameters
Diversity data is from 2014 (stale for post-SFFA calibration)
Joinable with tuition data for economic analysis
URL: https://www.kaggle.com/datasets/umerhaddii/economic-diversity-and-student-outcomes-data
| Attribute | Detail |
|---|---|
| Source | Opportunity Insights (Chetty et al.) via NYT Upshot |
| Focus | Economic mobility, income quintiles |
| Institutions | ~2,200 colleges |
| License | CC0 |
| Variable | Description |
|---|---|
super_opeid |
Institution cluster ID |
name |
College name |
par_income_bin |
Parent household income group (percentile) |
par_income_lab |
Income group label |
attend |
Test-score-reweighted attendance rate |
rel_apply |
Relative application rate |
rel_attend |
Relative attendance rate |
Income quintile data is excellent for calibrating socioeconomic hooks
Mobility rate = access (% from bottom quintile) x success rate
Complements Pell Grant data from College Scorecard
Does NOT include race/ethnicity breakdowns directly
Key finding: highest-mobility colleges are mid-tier publics
URL: https://www.kaggle.com/datasets/nivedithavudayagiri/college-enrollment-demographics-2021
| Attribute | Detail |
|---|---|
| Source | IPEDS (NCES) |
| Period | July 2020 -- June 2021 |
| Institutions | All IPEDS-reporting institutions |
| License | CC0 Public Domain |
Gender: Male/female headcounts
Race/Ethnicity: Nine IPEDS categories (White, Black, Hispanic, Asian, AIAN, NHPI, Two or more, Non-resident alien, Unknown)
Level: Undergraduate vs. graduate
Attendance: Full-time vs. part-time
Student type: First-time, transfer, continuing
Provides institutional-level race/ethnicity enrollment counts
More granular than College Scorecard (full-time/part-time, grad/undergrad splits)
Single year (2020-21), so no trend analysis
IPEDS UNITID enables joining with other NCES datasets
URL: https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset
| Attribute | Detail |
|---|---|
| Source | IPEDS via Tableau Public |
| Size | 1.2 MB |
| Demographics | SAT/ACT, ethnicity, immigration, gender |
Combines test scores with demographics (useful for cross-referencing)
Documentation is sparse; exact columns unclear without download
May overlap with College Scorecard data
URL: https://www.kaggle.com/datasets/paultimothymooney/historically-black-colleges-and-universities
| Attribute | Detail |
|---|---|
| Source | NCES / TidyTuesday |
| Years | 1910-2016 (degree attainment), 1976-2015 (HBCU enrollment) |
hbcu_all.csv -- HBCU enrollment data
hs_students.csv / bach_students.csv -- completion rates by race
Gender-disaggregated files (male/female)
White, Black, Hispanic, Asian/Pacific Islander, American Indian/Alaska Native, Two or more races
Only covers HBCUs (none in our 30-college list)
Useful for understanding long-term racial trends in higher education
Degree attainment data (1910-2016) provides historical context
URL: https://www.kaggle.com/datasets/samsonqian/college-admissions
| Attribute | Detail |
|---|---|
| Alt name | "Admission/Class Demographics by University" |
| Size | 223 KB |
| License | CC0 |
| Year | ~2018 |
Name suggests class demographics by university
Small dataset, likely limited institution coverage
Column details unavailable without download
URL: https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds
| Attribute | Detail |
|---|---|
| Source | IPEDS/NCES |
| Size | 19.3 MB |
| Focus | Student loan repayment, low-income students |
| Updated | January 2020 |
Focused on financial aid/loans rather than demographics
May contain some enrollment demographics as secondary data
Useful for financial aid yield modeling
The Supreme Court's SFFA v. Harvard/UNC decision (June 2023) ended race-conscious admissions. Two admissions cycles have occurred since: Class of 2028 (Fall 2024) and Class of 2029 (Fall 2025).
1. Post-SFFA Enrollment Dashboard (Class Action)
URL: https://www.joinclassaction.us/post/the-post-sffa-enrollment-dashboard
Interactive dashboard tracking racial demographic shifts
2. James Murphy's Post-SFFA Enrollment Tracker
URL: https://jamessmurphy.com/2026/01/07/the-2025-post-sffa-enrollment-tracker/
Covers 29+ institutions, 3 data points (2022-23 average, 2024, 2025)
Categories: Hispanic, Black, Asian, White, Multiracial, Unreported
No downloadable dataset; manually compiled from university press releases
3. College Transitions Enrollment Demographics
URL: https://www.collegetransitions.com/dataverse/enrollment-demographics/
300+ institutions, Common Data Set source
Enrollment by gender, race, international status (2024-25)
| Institution | Black (pre) | Black (post) | Hispanic (pre) | Hispanic (post) | Asian (pre) | Asian (post) |
|---|---|---|---|---|---|---|
| Harvard | 15.3% | 14.0% | 16.0% | 11.3% | -- | -- |
| MIT | -- | 5% | -- | 11% | -- | 47% |
| Yale | 14% | 14% | 18% | 19% | 30% | 24% |
| Amherst | -- | -8pp | -- | -4pp | -- | -- |
| UNC | 10.5% | 7.8% | 10.8% | 10.1% | -- | -- |
| WashU | -- | -5pp (all POC) | -- | -- | -- | -- |
| Tufts | -- | -3pp | -- | -- | -- | -- |
| UVA | -- | -1.4pp (Black+Asian) | -- | stable | -- | -- |
| Duke | -- | +1pp (Black+Hisp combined) | -- | -- | -- | -6pp |
Cornell: Black 4.3% -> 4.8%, Hispanic 9.5% -> 11.1% (vs. 25-27% BHI pre-SFFA)
Most institutions show partial recovery but remain below pre-SFFA levels
Increasing % of students declining to report race, complicating analysis
At 76 formerly race-conscious schools: average Black share dropped from 6.4% (Fall 2023) to 5.3% (Fall 2024) -- lowest since 2015
Highly selective institutions overall: URM freshmen declined 7% from 2023 to 2024
Black students: -16.3%
Hispanic students: -1.8%
Asian enrollment generally increased at most selective schools
Public flagships showed minority enrollment increases (per Inside Higher Ed, Feb 2026)
| Parameter | Data Source | Current Sim Value | Calibration Approach |
|---|---|---|---|
| % URM by college | College Scorecard (UGDS_BLACK + UGDS_HISP + UGDS_AIAN + UGDS_NHPI) | 1.2x hook | Set per-college URM % from data; adjust post-SFFA |
| % First-gen | College Scorecard (FIRST_GEN, PAR_ED_PCT_1STGEN) | 1.4x hook | Set per-college first-gen % from data |
| Gender ratio | College Scorecard (UGDS_MEN/UGDS_WOMEN) | Not modeled | Most selective schools ~50/50; LACs skew female |
| Low-income access | Opportunity Insights (par_income_bin) | Not modeled | Bottom-quintile access rate varies 3-20% by college |
| Post-SFFA shift | Murphy Tracker + institutional CDS | Not modeled | Black enrollment -16%, Hispanic -2%, Asian +5-7% at selective schools |
No single Kaggle dataset has post-SFFA (2024+) demographic data
First-gen data is inconsistently reported across institutions
URM definition varies (some include Asian subgroups, some don't)
"Decline to state" race category growing post-SFFA (complicates %)
No Kaggle dataset directly maps to the 30 specific colleges in our simulation -- manual filtering required from College Scorecard