Source: kaggle_survey_admissions.md
Comprehensive survey of Kaggle datasets relevant to the college admissions simulation (30 US elite colleges, agent-based model).
URL: https://www.kaggle.com/datasets/mexwell/elite-college-admissions
Author: mexwell
Size: ~371 KB
Rows: ~2.4 million domestic students aggregated across 139 selective universities
Years: Entering classes of 2010-2015
Key Fields: Attendance rates, application rates, matriculation rates (yield), SAT/ACT score bands (50-point), college tier classifications (6 tiers), parental income data (13 income bins from tax records), in-state/out-of-state metrics, post-college outcomes (grad school, earnings, occupations)
College Tiers: Ivy Plus (8 Ivies + MIT, Stanford, Duke, Chicago), Other elite, Highly selective, Selective, Flagship, Public/Private
Key Finding: Children from top 1% income families are 2x more likely to attend Ivy-Plus colleges than middle-class families with comparable test scores
Relevance: 10/10 -- Directly covers our 30 target colleges. Income-stratified attendance and application rates provide calibration data for hook effects (donor/legacy correlates with income). SAT score bands enable academic index calibration. Yield/matriculation rates by tier are essential for student decision modeling.
URL: https://www.kaggle.com/datasets/kaggle/college-scorecard
Author: Kaggle (official)
Size: ~590 MB
Rows: All US postsecondary institutions, multiple years
Years: 1997-2017 (multiple annual snapshots)
Key Fields: Admission rates, SAT/ACT scores, enrollment, graduation rates, earnings after graduation, debt levels, institutional characteristics, demographics
License: CC0 Public Domain
Views: 183K+ | Downloads: 27K+
Relevance: 9/10 -- Massive official dataset covering every US college. Contains acceptance rates, test score distributions, and institutional data for all 30 target colleges. Time series enables trend analysis. Missing: individual student-level data, hook effects, round-specific data.
URL: https://www.kaggle.com/datasets/thedevastator/u-s-department-of-education-college-scorecard-da
Author: The Devastator
Size: ~1.2 MB
Columns: 50+
Key Fields: UNITID, INSTNM, CITY, STABBR, CONTROL, REGION, LOCALE, Carnegie classification, diversity designations (HBCU, HSI, TRIBAL, AANAPII), enrollment totals (UGRD, GRAD), retention rates, repayment rates, graduation rates
Relevance: 8/10 -- Cleaned/curated version of College Scorecard with institutional characteristics. Good for enriching college profiles in the simulation.
URL: https://www.kaggle.com/datasets/samsonqian/college-admissions
Author: Samson Qian
Size: ~223 KB
Years: Pre-2018
Key Fields: Admission/class demographics by university
License: CC0 Public Domain
Views: 39K+ | Downloads: 4.4K+
Relevance: 7/10 -- Contains demographic breakdowns by university. Useful for calibrating diversity in student generation and admission outcomes.
URL: https://www.kaggle.com/datasets/yashgpt/us-college-data
Author: Yash Gupta
Size: ~32 KB
Rows: 777 colleges
Columns: 18
Key Fields: Private/Public, Apps (applications received), Accept (applications accepted), Enroll (enrolled), Top10perc, Top25perc, F.Undergrad, P.Undergrad, Outstate tuition, Room.Board, PhD%, Terminal%, S.F.Ratio, perc.alumni (donor rate), Expend, Grad.Rate
Relevance: 7/10 -- Contains application/acceptance/enrollment counts enabling acceptance rate and yield rate calculation. Alumni donation rate is a proxy for legacy/donor culture. Covers 777 colleges including all our targets.
URL: https://www.kaggle.com/datasets/joebeachcapital/university-and-college-rankings-us-news
Author: Joakim Arvidsson (from Andrew G. Reiter data)
Size: ~102 KB
Years: 1984-present (universities), 1985-present (LACs)
Key Fields: University name, rank, year, IPEDS ID, tier
License: ODbL
Relevance: 7/10 -- Historical rankings with IPEDS IDs for merging. Covers all 30 target colleges over 40 years. Useful for tier classification validation and selectivity trends.
URL: https://www.kaggle.com/datasets/flyingwombat/us-news-and-world-reports-college-data
Author: Jason Nguyen (from StatLib/CMU)
Size: ~32 KB
Rows: 777 | Columns: 18
Year: 1995
Key Fields: Same as US College Data above (Apps, Accept, Enroll, Top10perc, etc.)
License: GPL 2
Views: 36K+ | Downloads: 7.6K+
Relevance: 6/10 -- Identical schema to US College Data but from 1995. Useful for historical comparison but data is dated.
URL: https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds
Author: Muhammad (hark99)
Size: ~19 MB
Key Fields: IPEDS survey data -- institutional characteristics, enrollment, financial aid, graduation rates, student demographics
Focus: Low-income students (earnings up to $48K), loan approval prediction
Relevance: 6/10 -- Comprehensive IPEDS data covering all institutions. Financial aid and low-income focus is useful for first-gen hook calibration. Large and requires filtering.
URL: https://www.kaggle.com/datasets/nivedithavudayagiri/college-enrollment-demographics-2021
Author: Niveditha Vudayagiri
Size: ~1.9 MB
Years: 2020-21 (12-month period July 2020 - June 2021)
Key Fields: UNITID (IPEDS ID), enrollment level, gender, 9 race/ethnicity categories, full-time/part-time, degree-seeking status, first-time/transfer/continuing
License: CC0 Public Domain
Relevance: 6/10 -- Demographic enrollment data by institution. Useful for calibrating racial/ethnic composition at each college, and first-time vs. transfer breakdowns.
URL: https://www.kaggle.com/datasets/thedevastator/national-universities-rankings-explore-quality-t
Author: The Devastator
Key Fields: University rankings, quality metrics, tuition
Relevance: 5/10 -- Rankings data useful for tier validation.
Author: Eswar Chand
Size: ~20 KB
Key Fields: GRE, GPA, Rank (institution prestige 1-4), Admit (binary), SES (1-3), Gender, Race (Hispanic/Asian/African-American)
Relevance: 5/10 -- Small dataset focused on international students. SES and race fields are interesting for hook modeling but limited sample.
Author: NCAA (official)
Size: ~364 KB
Years: 2003-2014 (11 seasons)
Key Fields: APR scores, eligibility rates, retention rates, athlete counts, sport type, institution names
Relevance: 5/10 -- Official NCAA data. While it tracks academic performance rather than admissions, the institution-sport mapping helps identify which of our 30 colleges have strong athletic programs (relevant for athlete hook calibration).
URL: https://www.kaggle.com/datasets/thedevastator/university-student-enrollment-data
Author: The Devastator
Size: ~898 KB
Key Fields: Student demographics (age, gender, nationality), academic details, course data, faculty info
Relevance: 4/10 -- General enrollment data, not US-specific. Privacy concerns noted by creator.
URL: https://www.kaggle.com/datasets/mohansacharya/graduate-admissions
Author: Mohan S Acharya
Size: ~10 KB
Key Fields: GRE (out of 340), TOEFL (out of 120), University Rating (1-5), SOP strength, LOR strength, Undergrad GPA (out of 10), Research Experience (binary), Chance of Admit (0-1)
License: CC0 Public Domain
Views: 727K+ | Downloads: 127K+
Relevance: 2/10 -- Most popular admissions dataset on Kaggle but focuses on graduate (not undergraduate) admissions from an Indian perspective. GRE/TOEFL rather than SAT. Not applicable to US undergrad simulation.
URL: https://www.kaggle.com/datasets/akshaydattatraykhare/data-for-admission-in-the-university
Author: Akshay Dattatray Khare
Rows: 400 | Columns: 8
Key Fields: GRE, TOEFL, University Rating, SOP, LOR, CGPA, Research, Chance of Admit
Relevance: 2/10 -- Nearly identical to Graduate Admission 2. Graduate-focused, Indian context.
URL: https://www.kaggle.com/datasets/tanmoyie/us-graduate-schools-admission-parameters
Author: Tanmoy Das
Size: ~4.5 KB
Key Fields: GRE, SOP, CGPA
Relevance: 2/10 -- Graduate school focus, tiny dataset.
URL: https://www.kaggle.com/datasets/amanace/student-admission-dataset
Author: Aman Kumar
Rows: 250 | Columns: 4
Key Fields: GPA (2.5-4.0), SAT Score (900-1600), Extracurricular Activities, Admission Status (Accepted/Waitlisted/Rejected)
Relevance: 3/10 -- Synthetic/fictional data with relevant fields (SAT, GPA, ECs, admission status including waitlist). Too small and not based on real data, but schema is useful as a template.
URL: https://www.kaggle.com/datasets/annienelson/college-admissions-dataset
Author: Annie Nelson
Size: ~1.3 MB
Description: "A simple dataset for tracking a fictitious college admissions pipeline"
Relevance: 1/10 -- Explicitly fictitious pipeline tracking data. Not useful for calibration.
URL: https://www.kaggle.com/datasets/farhansadeek/university-admission-dataset
Author: Farhan Sadeek
Size: 756 bytes
Description: Created by a high school student about university skill requirements
Relevance: 1/10 -- Tiny, informal dataset about admission requirements.
URL: https://www.kaggle.com/datasets/pandanup/college-admission-data-set
Author: Anup Pandey
Size: ~2.3 KB
Relevance: 1/10 -- Very small, limited metadata available.
| Dataset | Primary Use Case |
|---|---|
| Elite College Admissions | Ivy+ tier attendance/application/yield rates by income bracket and SAT band. Calibrate hook effects (income as proxy for legacy/donor). |
| College Scorecard (official) | Acceptance rates, SAT/ACT distributions, enrollment for all 30 colleges across years. Primary source for college profiles. |
| US College Data | Application/acceptance/enrollment counts for yield rate calculation. Alumni donation rates. |
| Dataset | Primary Use Case |
|---|---|
| US News Rankings (historical) | Validate tier classifications. IPEDS IDs for cross-dataset merging. |
| College Admissions (Qian) | Demographic breakdowns by university for diversity calibration. |
| College Enrollment Demographics 2021 | Race/ethnicity composition at each college. First-time student breakdowns. |
| NCAA Academic Scores | Identify athletic programs at target colleges for athlete hook modeling. |
| Dataset | Primary Use Case |
|---|---|
| IPEDS Post-Secondary | Financial aid data for first-gen/low-income hook calibration. |
| Student Admission Dataset | Schema reference for synthetic student generation (SAT, GPA, EC, status). |
| College Scorecard (Devastator) | Cleaned institutional characteristics, diversity flags. |
The following simulation-critical data is not well-represented on Kaggle:
Early Decision / Early Action rates -- No Kaggle dataset breaks down ED vs EA vs RD acceptance rates. Must source from Common Data Sets (CDS) directly or college websites.
Legacy/Donor/Athlete hook effects -- No dataset quantifies the admissions boost from legacy, donor, or recruited athlete status. The Elite College Admissions dataset's income stratification is the closest proxy. Primary sources: lawsuits (Harvard SFFA trial data), research papers (Arcidiacono et al.), CDS Section C7.
Individual student-level application data -- Kaggle datasets are aggregated at institutional level. No student-level data with application lists, outcomes by school, and demographic hooks. Closest source: College Confidential scrapes, r/collegeresults, or HERI/CIRP surveys.
Waitlist conversion rates -- Not available on Kaggle. Must source from individual college CDS reports or NACAC surveys.
High school feeder data -- No Kaggle dataset maps high school to college placement. Must source from school profile reports or Naviance-style aggregated data.
Financial aid / merit scholarship impact on yield -- Limited on Kaggle. College Scorecard has some debt/aid data but not merit award effects on enrollment decisions.