Kaggle Dataset Survey: College Admissions

Source: kaggle_survey_admissions.md


Kaggle Dataset Survey: College Admissions

Comprehensive survey of Kaggle datasets relevant to the college admissions simulation (30 US elite colleges, agent-based model).


Tier 1: Highly Relevant Datasets

1. Elite College Admissions

2. US Dept of Education: College Scorecard

3. College Scorecard (Devastator version)

4. College Admissions (Samson Qian)

5. US College Data

6. US University & College Rankings (US News Historical)


Tier 2: Moderately Relevant Datasets

7. U.S. News and World Report's College Data (1995)

8. Post Secondary Education Data (IPEDS)

9. College Enrollment Demographics 2021

10. National Universities Rankings

11. Admission (USA Colleges - International Students)

12. Academic Scores for NCAA Athletic Programs

13. University Student Enrollment Data


Tier 3: Low Relevance (Graduate/International Focus)

14. Graduate Admission 2

15. Data for Admission in the University

16. US Graduate School Admission Parameters

17. Student Admission Dataset (Synthetic)

18. College Admissions Dataset (Fictitious)

19. University Admission Dataset

20. College Admission Data Set


Must-Use (directly calibrate the simulation)

Dataset Primary Use Case
Elite College Admissions Ivy+ tier attendance/application/yield rates by income bracket and SAT band. Calibrate hook effects (income as proxy for legacy/donor).
College Scorecard (official) Acceptance rates, SAT/ACT distributions, enrollment for all 30 colleges across years. Primary source for college profiles.
US College Data Application/acceptance/enrollment counts for yield rate calculation. Alumni donation rates.

Should-Use (enrich and validate)

Dataset Primary Use Case
US News Rankings (historical) Validate tier classifications. IPEDS IDs for cross-dataset merging.
College Admissions (Qian) Demographic breakdowns by university for diversity calibration.
College Enrollment Demographics 2021 Race/ethnicity composition at each college. First-time student breakdowns.
NCAA Academic Scores Identify athletic programs at target colleges for athlete hook modeling.

Nice-to-Have (supplementary)

Dataset Primary Use Case
IPEDS Post-Secondary Financial aid data for first-gen/low-income hook calibration.
Student Admission Dataset Schema reference for synthetic student generation (SAT, GPA, EC, status).
College Scorecard (Devastator) Cleaned institutional characteristics, diversity flags.

Data Gaps on Kaggle

The following simulation-critical data is not well-represented on Kaggle:

  1. Early Decision / Early Action rates -- No Kaggle dataset breaks down ED vs EA vs RD acceptance rates. Must source from Common Data Sets (CDS) directly or college websites.

  2. Legacy/Donor/Athlete hook effects -- No dataset quantifies the admissions boost from legacy, donor, or recruited athlete status. The Elite College Admissions dataset's income stratification is the closest proxy. Primary sources: lawsuits (Harvard SFFA trial data), research papers (Arcidiacono et al.), CDS Section C7.

  3. Individual student-level application data -- Kaggle datasets are aggregated at institutional level. No student-level data with application lists, outcomes by school, and demographic hooks. Closest source: College Confidential scrapes, r/collegeresults, or HERI/CIRP surveys.

  4. Waitlist conversion rates -- Not available on Kaggle. Must source from individual college CDS reports or NACAC surveys.

  5. High school feeder data -- No Kaggle dataset maps high school to college placement. Must source from school profile reports or Naviance-style aggregated data.

  6. Financial aid / merit scholarship impact on yield -- Limited on Kaggle. College Scorecard has some debt/aid data but not merit award effects on enrollment decisions.