Kaggle Survey: US High School Datasets

Source: kaggle_survey_high_schools.md


Kaggle Survey: US High School Datasets

Summary

Kaggle has limited but useful high school data for the simulation. The strongest datasets are NYC SAT scores by school, NCES private school data, and the Elite College Admissions dataset (Opportunity Insights). There is no single Kaggle dataset that maps specific high schools to college enrollment outcomes (feeder pipelines). Feeder school data exists only in journalism (Harvard Crimson) and now-defunct scraped sites.


Tier 1: Highly Relevant Datasets

1. Elite College Admissions (Opportunity Insights)

2. Average SAT Scores for NYC Public Schools

3. New York City SAT Results (2012)

4. US Private Schools (NCES PSS)

5. US Schools Dataset (130K+ schools)


Tier 2: Useful Supporting Datasets

6. US School Scores (SAT by State)

7. U.S. Education Datasets: Unification Project

8. USA Public Schools (NCES CCD)

9. Public and Private Schools with GeoCoords


Tier 3: Limited Relevance

10. US Highschool Students Dataset

11. High School Student Performance & Demographics

12. NCES US Education Data (State-level)

13. Graduation Rate Dataset


Feeder School Pipeline Data

What Exists (Not on Kaggle)

No Kaggle dataset maps specific high schools to college enrollment outcomes. The closest resources:

  1. Harvard Crimson Feeder Analysis (2024)

  2. Source: https://interactives.thecrimson.com/2024/news/feeders

  3. Methodology: Used Harvard Freshman Register (self-reported) over 15 years

  4. Finding: 21 schools sent 2,216+ students to Harvard since 2009 ("1 in 11 students")

  5. Named schools: Boston Latin (~18/yr), Phillips Andover (~11/yr), Phillips Exeter, Deerfield Academy, Harvard-Westlake, Thomas Jefferson HS (VA), Scarsdale HS, Lexington HS, Brookline HS, Belmont HS, Cambridge Rindge and Latin

  6. NOT downloadable -- Harvard's Office of Institutional Research denied data requests

  7. Opportunity Insights Data Portal

  8. URL: https://opportunityinsights.org/data/

  9. College-level mobility data (not school-level feeders)

  10. Social capital estimates for every school, college, and ZIP code

  11. Tax-linked data covering ~48 million records (birth cohorts 1980-1991)

  12. Some data available on Kaggle via the Elite College Admissions dataset

  13. IvyLeagueFeeders.com -- now defunct (redirects to unrelated site). Previously ranked high schools by % of graduates attending Ivy League.

  14. Stuyvesant HS Data (from journalism): 40.9% of 2021 graduating class admitted to Ivy League universities.

Why This Gap Exists


Recommendations for the Simulation

Immediately Usable

  1. NYC SAT dataset -- calibrate "public magnet" (Stuyvesant: ~1450 avg SAT) vs "urban public" (some NYC schools: ~800-900 avg SAT) archetypes
  2. US Private Schools (NCES PSS) -- download to identify elite boarding schools (Exeter, Andover, Choate, Deerfield, etc.) by name and confirm enrollment/type classification
  3. Elite College Admissions -- use income-stratified application/attendance rates to calibrate how SES affects college targeting in the simulation

For Extended 300-School Mode

  1. US Schools Dataset (130K) -- could be filtered to build a more comprehensive school list with geographic data
  2. US School Scores -- income-bracket SAT distributions for calibrating score generation by school SES tier

Data Not Available on Kaggle (Needs Other Sources)