Kaggle College & Admissions Data: Research Summary

Source: kaggle_research_summary.md


Kaggle College & Admissions Data: Research Summary

Comprehensive synthesis of 8 parallel research agents surveying Kaggle and related sources for data relevant to the college-sim project. Total: ~60+ datasets evaluated.


Crown Jewel Dataset

Elite College Admissions (Opportunity Insights / Chetty)


Master Dataset Rankings

Rank Dataset URL Size Key Fields Sim Relevance
1 Elite College Admissions (Chetty) kaggle.com/mexwell/elite-college-admissions 371 KB Yield×income×tier×SAT, 2.4M students 10/10
2 College Scorecard (DOE official) kaggle.com/kaggle/college-scorecard 590 MB ADM_RATE, SAT_AVG, demographics, net price, earnings — 6,500+ institutions 9/10
3 College Scorecard API (live) api.data.gov/ed/collegescorecard/v1/schools v3.5.3 Nov 2025; current ADM_RATE for all 30 sim colleges 9/10
4 College Common Data Sets (theriley106) kaggle.com/theriley106/college-common-data-sets 220 MB CDS Section C (admit/yield), Section H (financial aid) — 173 schools 8/10
5 US College Data (ISLR 1995) kaggle.com/yashgpt/us-college-data 32 KB Apps/Accept/Enroll, Top10%, GradRate, S/F, Alumni%, Expend — already integrated 7/10
6 US School Scores (mexwell) kaggle.com/mexwell/us-school-scores 120 KB SAT Math/Verbal × income bracket, state-level, 2005-2023 7/10
7 NYC SAT by School (nycopendata) kaggle.com/nycopendata/high-schools 25 KB School-level SAT + demographics for 400+ NYC public schools, 2014-15 7/10
8 US Private Schools (NCES PSS) kaggle.com/joebeachcapital/us-private-schools 20K+ private schools with type (boarding/day/religious), 2017-18 7/10
9 Economic Diversity (Opportunity Insights) kaggle.com/umerhaddii/economic-diversity-and-student-outcomes-data Income quintile access rates for 2,200 colleges 7/10
10 College Tuition, Diversity & Pay (TidyTuesday) kaggle.com/jessemostipak/college-tuition-diversity-and-pay 2 MB Net cost × income bracket, diversity, earnings, tuition trends 1985-2016 6/10
11 College Enrollment Demographics 2021 (IPEDS) kaggle.com/nivedithavudayagiri/college-enrollment-demographics-2021 2 MB Race/gender enrollment counts, UNITID-joinable 6/10
12 US News Rankings (joebeachcapital) kaggle.com/joebeachcapital/university-and-college-rankings-us-news 102 KB 40-year ranking history, IPEDS IDs for cross-referencing 6/10
13 NCAA Academic Scores (official) kaggle.com/ncaa/academic-scores 364 KB APR, sport, institution — identifies athletic program strength 5/10

Simulation Parameter Validation

What's Well Calibrated (No Changes Needed)

Parameter Sim Value Real Data Status
Avg apps/student 6.8 6.80 (CommonApp 2024-25) ✅ EXACT
ED fills % of class 40-60% 40-55% Ivy+, up to 60% WashU/Vandy ✅ GOOD
Hook: first-gen 1.4× Growing priority; modest boost confirmed ✅ OK
Application volume growth +47% since 2013; 10.19M total apps 2024-25 ✅ Consistent

What Needs Recalibration

1. Financial Aid Elasticity (HIGH PRIORITY)

2. ED Acceptance Multipliers (MEDIUM PRIORITY)

Per-school data now available (Class of 2029):

School ED Rate Overall Multiplier
Columbia 13.2% 3.9% 3.4×
Northwestern 23% 7.7% 3.0×
Duke 19.7% 6.7% 2.9×
UChicago ~20% ~5% ~4× ED
Dartmouth 19.1% 5.4% 3.5×
Brown 14.4% 5.4% 2.7×
UPenn 14.2% 5.4% 2.6×
Harvard REA ~9% 3.6% 2.5×
Yale SCEA 10.8% 4.5% 2.4×
MIT EA 5.2% 4.5% 1.2× (minimal)
Notre Dame REA 12.9% 11.2% 1.2× (minimal)
Amherst ED 29.3% 9% 3.3×
Williams ED 23.3% 8.3% 2.8×
Middlebury ED 30.5% 10.7% 2.9×

3. Yield by Income Bracket (MEDIUM PRIORITY)

Elite College Admissions dataset shows yield varies dramatically with parental income:

4. Application Count Distribution (LOW PRIORITY)


Post-SFFA Demographic Data

No Kaggle dataset has post-2024 enrollment. Best non-Kaggle sources:

Key numbers (Class of 2028, first post-SFFA cycle):

Simulation impact: URM hook (1.2×) is still valid as a preference signal — the reduction in outcomes is from the removal of explicit race consideration, but schools still value diversity through holistic means (geography, first-gen, etc.).


College Scorecard API — 30 Colleges Data

API endpoint: https://api.data.gov/ed/collegescorecard/v1/schools (free key, 1K req/hr) Latest release: v3.5.3, November 2025 (Kaggle mirror is stale — use API directly)

Key insight from k7: Class of 2029 actual rates are significantly lower than Scorecard values. Always use actual rates (from CDS/CollegeVine) for the 30 simulation colleges, not Scorecard ADM_RATE.

Example discrepancies:

Most useful Scorecard fields for simulation enrichment:


High School Data

What Exists on Kaggle

What Doesn't Exist Anywhere


Data Gaps That Won't Be Filled by Kaggle

  1. ED/EA/RD round-specific acceptance rates — Not on Kaggle; must source from CDS or CollegeVine
  2. Legacy/donor hook quantification — No dataset; only from litigation (Harvard SFFA trial)
  3. Individual student-level application data — FERPA; closest is r/collegeresults (not structured)
  4. Waitlist conversion rates — Individual CDS reports only
  5. Feeder school pipeline — Journalism only, not systematic

Priority 1: College Scorecard API (free, current)

Fetch UGDS_* demographic % and FIRST_GEN for all 30 colleges. Display in college cards alongside ISLR stats already integrated.

Priority 2: ED Multiplier Tuning

Update rateE multipliers in research_colleges.json for the 14 schools where we now have precise Class of 2029 ED/EA rates (table above). MIT = 1.2×, Notre Dame = 1.2×, Columbia = 3.4×, UChicago = 4×, Dartmouth = 3.5× are the biggest changes.

Priority 3: Income-Differentiated Yield

Add incomeYieldMod to student generation: low-income students get +10–15pp yield bonus at HYPSM (full-need schools), high-income get -5pp. Feeds from Chetty dataset findings.

Priority 4: Application Count Distribution

Replace flat 6.8 with archetype-based distribution:


Source Files

File Agent Datasets
kaggle_survey_admissions.md k1 20 datasets, overall survey
kaggle_survey_test_scores.md k2 14 datasets, SAT/ACT distributions
kaggle_survey_college_stats.md k3 13 datasets, rankings/institutional
kaggle_survey_demographics.md k4 8 datasets + post-SFFA data
kaggle_survey_financial_aid.md k5 10 datasets + Avery/Hoxby research
kaggle_survey_high_schools.md k6 13 datasets, HS profiles
kaggle_scorecard_deepdive.md k7 Full field reference + 30-college ADM/SAT table
kaggle_survey_applications.md k8 CommonApp trend data + ED multiplier table