Source: kaggle_survey_high_schools.md
Kaggle has limited but useful high school data for the simulation. The strongest datasets are NYC SAT scores by school, NCES private school data, and the Elite College Admissions dataset (Opportunity Insights). There is no single Kaggle dataset that maps specific high schools to college enrollment outcomes (feeder pipelines). Feeder school data exists only in journalism (Harvard Crimson) and now-defunct scraped sites.
URL: https://www.kaggle.com/datasets/mexwell/elite-college-admissions
Records: ~2.4 million domestic students across 139 selective colleges
Years: Entering classes 2010-2015
Key Fields: Attendance rates, application rates, SAT/ACT scores (50-point bands), college classification (Ivy Plus / other elite / highly selective / selective), public/private, parental income (13 brackets via tax records), post-college outcomes (grad school, earnings, employers)
Source: Chetty et al. / Opportunity Insights (tax-linked data)
Relevance: HIGH -- provides income-stratified application and attendance rates by college tier, SAT band distributions by income. Can calibrate how student SES maps to elite college targeting. Does NOT have high-school-level data but has income-to-college-tier flows.
URL: https://www.kaggle.com/datasets/nycopendata/high-schools
Records: All accredited NYC high schools (~400+)
Years: 2014-2015
Key Fields: School name, borough, building code, address, lat/long, enrollment, race breakdown, average SAT Math, SAT Reading/Writing scores
Source: NYC DOE + College Board
License: CC0 Public Domain
Relevance: HIGH -- individual school-level SAT distributions with demographics. Includes Stuyvesant, Bronx Science, Brooklyn Tech, and other specialized schools. Can calibrate "public magnet" and "urban public" archetypes.
URL: https://www.kaggle.com/datasets/new-york-city/new-york-city-sat-results
Records: NYC schools (2012 graduating seniors)
Years: 2012
Key Fields: School-level mean SAT scores for college-bound seniors
Source: NYC Open Data / Socrata API
License: CC0 Public Domain
Relevance: MEDIUM-HIGH -- older data but same school set as above. Useful for longitudinal comparison.
URL: https://www.kaggle.com/datasets/joebeachcapital/us-private-schools
Records: ~20,000+ private elementary and secondary schools
Years: 2017-2018 school year
Key Fields: School name, location, NCES school ID, geographic data. Full metadata in "Entities and Attributes" section (requires download). Source is NCES Private School Universe Survey (PSS).
Source: NCES Private School Survey
Relevance: HIGH -- the PSS includes school type (religious affiliation, boarding/day, coed/single-sex), enrollment by grade, student/teacher ratio. Can identify elite boarding (Exeter, Andover, Choate), elite day (Dalton, Brearley, Collegiate), and religious schools (Jesuit, Catholic diocesan). Needs download to confirm exact fields.
URL: https://www.kaggle.com/datasets/andrewmvd/us-schools-dataset
Records: 130,000+ schools (public AND private)
Years: ~2020 (last modified Aug 2020)
Key Fields: School name, address, geographic coordinates, public/private indicator
Source: US DHS / HIFLD
License: CC0 Public Domain
Relevance: MEDIUM -- geographic reference for all US schools but limited academic data. Useful for mapping school locations to demographics.
URL: https://www.kaggle.com/datasets/mexwell/us-school-scores
Records: State-level aggregations across multiple years
Years: 2005-2023
Key Fields: Year, state, total math/verbal scores, number of test-takers, average GPA by subject (arts, English, foreign languages, math, natural sciences, social sciences), scores by family income bracket
License: GPL 2
Relevance: MEDIUM -- state-level only (not school-level), but income-stratified SAT data is valuable for calibrating SES-to-score distributions.
URL: https://www.kaggle.com/datasets/noriuk/us-education-datasets-unification-project
Records: State-level, 50 states x multiple years
Years: 2009+
Key Fields: Enrollment by grade (K-12, grades 9-12, grade 12), NAEP math/reading scores (grades 4 & 8), revenue/expenditure data, demographics (7 race categories + gender in extended version)
Relevance: MEDIUM -- state-level enrollment and achievement trends. No school-level or graduation rate data.
URL: https://www.kaggle.com/datasets/carlosaguayo/usa-public-schools
Records: All US public K-12 schools
Years: 2014-2015
Source: NCES Common Core of Data (CCD)
Key Fields: Address, geographic data. Full field listing requires download.
Relevance: MEDIUM -- complements private school data for full school universe.
URL: https://www.kaggle.com/datasets/mysterymeatie/public-and-private-schools-in-the-us-w-geocoords
Records: Unknown (small dataset, 267 views)
Key Fields: Public/private flag, geographic coordinates
Relevance: LOW-MEDIUM -- geographic mapping only.
URL: https://www.kaggle.com/datasets/petermushemi/us-highschool-students-dataset
Key Fields: Sex, age, state, address (urban/rural), family size, parental education/job, GPA, math/reading/writing scores, attendance rate, suspensions, teacher support, counseling
Relevance: LOW -- synthetic/survey student-level data. No school names, no SAT, no college outcomes.
URL: https://www.kaggle.com/datasets/dillonmyrick/high-school-student-performance-and-demographics
Records: ~382 shared students across Math and Portuguese courses
Years: 2008 (Portuguese schools)
Key Fields: 33 attributes including grades (0-20 scale), parent education, study time, absences, alcohol consumption
Relevance: NONE -- Portuguese schools, not US. 0-20 grading scale. Not applicable.
URL: https://www.kaggle.com/datasets/georgetryfiates/national-center-for-education-statistics
Records: State-level, 2001-2019
Key Fields: Expenditures per pupil, revenues per student, FTE teachers, grades 9-12 enrollment
Relevance: LOW -- state-level financial data only.
URL: https://www.kaggle.com/datasets/rkiattisak/graduation-rate
Records: 1,000 rows
Key Fields: ACT, SAT, parental education, parental income, high school GPA, college GPA, years to graduate
WARNING: Synthetic/randomly generated data (not real). Intended for educational purposes only.
Relevance: NONE -- fake data.
No Kaggle dataset maps specific high schools to college enrollment outcomes. The closest resources:
Harvard Crimson Feeder Analysis (2024)
Source: https://interactives.thecrimson.com/2024/news/feeders
Methodology: Used Harvard Freshman Register (self-reported) over 15 years
Finding: 21 schools sent 2,216+ students to Harvard since 2009 ("1 in 11 students")
Named schools: Boston Latin (~18/yr), Phillips Andover (~11/yr), Phillips Exeter, Deerfield Academy, Harvard-Westlake, Thomas Jefferson HS (VA), Scarsdale HS, Lexington HS, Brookline HS, Belmont HS, Cambridge Rindge and Latin
NOT downloadable -- Harvard's Office of Institutional Research denied data requests
Opportunity Insights Data Portal
College-level mobility data (not school-level feeders)
Social capital estimates for every school, college, and ZIP code
Tax-linked data covering ~48 million records (birth cohorts 1980-1991)
Some data available on Kaggle via the Elite College Admissions dataset
IvyLeagueFeeders.com -- now defunct (redirects to unrelated site). Previously ranked high schools by % of graduates attending Ivy League.
Stuyvesant HS Data (from journalism): 40.9% of 2021 graduating class admitted to Ivy League universities.
FERPA privacy restrictions prevent schools from releasing student-level enrollment outcomes
College Board and ACT do not share school-level score distributions publicly
Feeder pipeline data is primarily held by admissions offices (proprietary)
Self-reported data (Naviance, school profiles) exists but is not aggregated into public datasets
School-level SAT/ACT distributions -- only available via individual school profiles (Naviance) or College Board data licenses
Feeder school → college enrollment mappings -- only in journalism, not systematic datasets
School-type classification (elite boarding, public magnet, etc.) -- would need to be hand-coded from NCES PSS + knowledge of specific schools
GPA distributions by school type -- no public dataset exists; must estimate from school profile data