Source: kaggle_research_summary.md
Comprehensive synthesis of 8 parallel research agents surveying Kaggle and related sources for data relevant to the college-sim project. Total: ~60+ datasets evaluated.
URL: https://www.kaggle.com/datasets/mexwell/elite-college-admissions
Size: 371 KB | License: Not specified
Coverage: 2.4M domestic students × 139 selective colleges, entering classes 2010-2015
Fields: Attendance rates, application rates, matriculation/yield rates — all broken down by 13 parental income percentile bins (via IRS tax records) × college tier × SAT 50-point bands
Tiers: Ivy Plus, Other Elite, Highly Selective, Selective (maps exactly to our HYPSM/Ivy+/Near-Ivy/Selective)
Why it's #1: Only dataset that cross-tabs SAT × income × college tier × yield at elite schools. Children from top 1% are 2× more likely to attend Ivy+ than middle-class students with equal test scores. Directly calibrates donor/legacy hooks as income proxies, and yield by income bracket.
| Rank | Dataset | URL | Size | Key Fields | Sim Relevance |
|---|---|---|---|---|---|
| 1 | Elite College Admissions (Chetty) | kaggle.com/mexwell/elite-college-admissions | 371 KB | Yield×income×tier×SAT, 2.4M students | 10/10 |
| 2 | College Scorecard (DOE official) | kaggle.com/kaggle/college-scorecard | 590 MB | ADM_RATE, SAT_AVG, demographics, net price, earnings — 6,500+ institutions | 9/10 |
| 3 | College Scorecard API (live) | api.data.gov/ed/collegescorecard/v1/schools | — | v3.5.3 Nov 2025; current ADM_RATE for all 30 sim colleges | 9/10 |
| 4 | College Common Data Sets (theriley106) | kaggle.com/theriley106/college-common-data-sets | 220 MB | CDS Section C (admit/yield), Section H (financial aid) — 173 schools | 8/10 |
| 5 | US College Data (ISLR 1995) | kaggle.com/yashgpt/us-college-data | 32 KB | Apps/Accept/Enroll, Top10%, GradRate, S/F, Alumni%, Expend — already integrated | 7/10 |
| 6 | US School Scores (mexwell) | kaggle.com/mexwell/us-school-scores | 120 KB | SAT Math/Verbal × income bracket, state-level, 2005-2023 | 7/10 |
| 7 | NYC SAT by School (nycopendata) | kaggle.com/nycopendata/high-schools | 25 KB | School-level SAT + demographics for 400+ NYC public schools, 2014-15 | 7/10 |
| 8 | US Private Schools (NCES PSS) | kaggle.com/joebeachcapital/us-private-schools | — | 20K+ private schools with type (boarding/day/religious), 2017-18 | 7/10 |
| 9 | Economic Diversity (Opportunity Insights) | kaggle.com/umerhaddii/economic-diversity-and-student-outcomes-data | — | Income quintile access rates for 2,200 colleges | 7/10 |
| 10 | College Tuition, Diversity & Pay (TidyTuesday) | kaggle.com/jessemostipak/college-tuition-diversity-and-pay | 2 MB | Net cost × income bracket, diversity, earnings, tuition trends 1985-2016 | 6/10 |
| 11 | College Enrollment Demographics 2021 (IPEDS) | kaggle.com/nivedithavudayagiri/college-enrollment-demographics-2021 | 2 MB | Race/gender enrollment counts, UNITID-joinable | 6/10 |
| 12 | US News Rankings (joebeachcapital) | kaggle.com/joebeachcapital/university-and-college-rankings-us-news | 102 KB | 40-year ranking history, IPEDS IDs for cross-referencing | 6/10 |
| 13 | NCAA Academic Scores (official) | kaggle.com/ncaa/academic-scores | 364 KB | APR, sport, institution — identifies athletic program strength | 5/10 |
| Parameter | Sim Value | Real Data | Status |
|---|---|---|---|
| Avg apps/student | 6.8 | 6.80 (CommonApp 2024-25) | ✅ EXACT |
| ED fills % of class | 40-60% | 40-55% Ivy+, up to 60% WashU/Vandy | ✅ GOOD |
| Hook: first-gen | 1.4× | Growing priority; modest boost confirmed | ✅ OK |
| Application volume growth | — | +47% since 2013; 10.19M total apps 2024-25 | ✅ Consistent |
Current: $1K aid = 2–4pp yield increase
Avery & Hoxby (NBER): $1K grant = ~11pp enrollment probability increase
Resolution: The 11pp is for high-aptitude students across options; at elite schools with high base yield, effect is smaller. Recommend:
Full-pay families: $1K = 1–2pp
Aid-receiving families: $1K = 3–5pp
Low-income at non-full-need schools: $1K = 5–8pp
Per-school data now available (Class of 2029):
| School | ED Rate | Overall | Multiplier |
|---|---|---|---|
| Columbia | 13.2% | 3.9% | 3.4× |
| Northwestern | 23% | 7.7% | 3.0× |
| Duke | 19.7% | 6.7% | 2.9× |
| UChicago | ~20% | ~5% | ~4× ED |
| Dartmouth | 19.1% | 5.4% | 3.5× |
| Brown | 14.4% | 5.4% | 2.7× |
| UPenn | 14.2% | 5.4% | 2.6× |
| Harvard REA | ~9% | 3.6% | 2.5× |
| Yale SCEA | 10.8% | 4.5% | 2.4× |
| MIT EA | 5.2% | 4.5% | 1.2× (minimal) |
| Notre Dame REA | 12.9% | 11.2% | 1.2× (minimal) |
| Amherst ED | 29.3% | 9% | 3.3× |
| Williams ED | 23.3% | 8.3% | 2.8× |
| Middlebury ED | 30.5% | 10.7% | 2.9× |
Elite College Admissions dataset shows yield varies dramatically with parental income:
Low-income students with full financial aid → yield ~90%+ at HYPSM
High-income students (many options) → yield ~65–75% at HYPSM
Current sim doesn't differentiate by income — all students use same yield curve
Current: flat 6.8 average
Reality: distribution around that mean varies by archetype
Elite boarding/feeder students: 8–12 apps (reach-heavy)
Average students: 5–8 apps (balanced)
Low-income/first-gen: 4–6 apps (fee waiver limits)
No Kaggle dataset has post-2024 enrollment. Best non-Kaggle sources:
James Murphy's Tracker (jamessmurphy.com) — 29 schools, 3 data points
College Transitions Dataverse (collegetransitions.com/dataverse) — 300+ schools
Key numbers (Class of 2028, first post-SFFA cycle):
Highly selective schools overall: URM freshmen -7% from 2023 to 2024
Black students: -16.3%
Hispanic students: -1.8%
Asian students: generally up 5–7%
At 76 formerly race-conscious schools: average Black share 6.4% → 5.3%
Simulation impact: URM hook (1.2×) is still valid as a preference signal — the reduction in outcomes is from the removal of explicit race consideration, but schools still value diversity through holistic means (geography, first-gen, etc.).
API endpoint: https://api.data.gov/ed/collegescorecard/v1/schools (free key, 1K req/hr)
Latest release: v3.5.3, November 2025 (Kaggle mirror is stale — use API directly)
Key insight from k7: Class of 2029 actual rates are significantly lower than Scorecard values. Always use actual rates (from CDS/CollegeVine) for the 30 simulation colleges, not Scorecard ADM_RATE.
Example discrepancies:
Vanderbilt: Scorecard ~12% → Actual 2029: 4.6%
UChicago: Scorecard ~6% → Actual 2029: ~5%
Most useful Scorecard fields for simulation enrichment:
UGDS_BLACK, UGDS_HISP, UGDS_ASIAN — demographic % per college
FIRST_GEN / PAR_ED_PCT_1STGEN — first-gen % (calibrate 1.4× hook)
PCTPELL — Pell grant % (low-income proxy)
NPT41–NPT45 — net price by income quintile (yield model)
C150_4 — 6-year graduation rate
RET_FT4 — first-year retention rate
MD_EARN_WNE_P10 — median earnings 10yr post-entry
NYC SAT by School — Stuyvesant ~1450, Bronx Science ~1390 vs. avg NYC ~900 Can calibrate "public magnet" (1380) vs. "average public" (1130) archetypes
NCES Private Schools (PSS) — 20K+ schools with boarding/day/religious type, can find Exeter/Andover/Choate by name
US Schools Dataset — 130K+ schools with geo, for extended 300-school mode
Feeder school → college enrollment pipelines — FERPA prevents this; only exists in journalism (Harvard Crimson's 21-school analysis) and defunct sites (IvyLeagueFeeders.com)
School-level SAT distributions — College Board doesn't release these publicly
GPA distributions by school type — no public dataset; must estimate
Fetch UGDS_* demographic % and FIRST_GEN for all 30 colleges.
Display in college cards alongside ISLR stats already integrated.
Update rateE multipliers in research_colleges.json for the 14 schools where we now
have precise Class of 2029 ED/EA rates (table above). MIT = 1.2×, Notre Dame = 1.2×,
Columbia = 3.4×, UChicago = 4×, Dartmouth = 3.5× are the biggest changes.
Add incomeYieldMod to student generation: low-income students get +10–15pp yield bonus
at HYPSM (full-need schools), high-income get -5pp. Feeds from Chetty dataset findings.
Replace flat 6.8 with archetype-based distribution:
Elite boarding: Math.round(normalRandom(10, 2))
Public magnet: Math.round(normalRandom(8, 2))
Average: Math.round(normalRandom(6, 1.5))
| File | Agent | Datasets |
|---|---|---|
kaggle_survey_admissions.md |
k1 | 20 datasets, overall survey |
kaggle_survey_test_scores.md |
k2 | 14 datasets, SAT/ACT distributions |
kaggle_survey_college_stats.md |
k3 | 13 datasets, rankings/institutional |
kaggle_survey_demographics.md |
k4 | 8 datasets + post-SFFA data |
kaggle_survey_financial_aid.md |
k5 | 10 datasets + Avery/Hoxby research |
kaggle_survey_high_schools.md |
k6 | 13 datasets, HS profiles |
kaggle_scorecard_deepdive.md |
k7 | Full field reference + 30-college ADM/SAT table |
kaggle_survey_applications.md |
k8 | CommonApp trend data + ED multiplier table |