Source: kaggle_survey_test_scores.md
Identify datasets to validate and improve the simulation's SAT score distributions by school type (elite boarding ~1479, elite day ~1424, public magnet ~1380, etc.).
URL: https://www.kaggle.com/datasets/mexwell/us-school-scores
Fields: Year, State.Code, State.Name, Total.Math, Total.Verbal, Total.Test-takers, academic subject GPAs (Arts/Music, English, Foreign Languages, Mathematics, Natural Sciences, Social Sciences/History), Family Income brackets (with Math, Verbal, Test-taker counts per bracket)
Years: 2005+ (multiple years, exact range unclear from metadata)
Granularity: State-level aggregates
Sample size: ~50 states x multiple years
File size: ~120 KB (zip)
Relevance: HIGH -- Income-bracket breakdowns of SAT Math/Verbal scores are directly useful for calibrating score distributions by socioeconomic tier. Can map income brackets to school types (e.g., elite private schools draw from higher-income families).
URL: https://www.kaggle.com/datasets/mexwell/elite-college-admissions
Fields: Income brackets (13 levels by percentile, e.g., 75.0 = 70th-80th; 99.5 = top 1%), attendance rates (raw and test-score-reweighted), application rates, relative attendance conditional on application, in-state/out-of-state breakdowns, test score band analysis (50-point SAT bands), college tier classification, institution type (public/private)
Years: Entering classes 2010-2015 (~2.4 million domestic students)
Granularity: University x income-bracket level
Coverage: 139 selective universities, classified into tiers: Ivy Plus, other elite, highly selective, selective
Sample size: ~2.4 million students
File size: 371 KB (zip)
Relevance: VERY HIGH -- Contains SAT score bands cross-tabulated with income and college tier. Directly validates how SAT distributions map to admission outcomes at the exact tier of colleges in our simulation (HYPSM, Ivy+, Near-Ivy, Selective).
URL: https://www.kaggle.com/datasets/thedevastator/unlocking-achievement-understanding-california-s
Fields: CDS code (County/District/School), School name, District name, County, Record Type, Grade 12 enrollment, Number of test takers, Average Critical Reading (200-800), Average Mathematics (200-800), Average Writing (200-800), Combined scores, Percent achieving 1500+ combined
Years: 2015-2016
Granularity: School-level, District-level, County-level, and State-level (multi-level)
Sample size: All California public schools (schools with <15 test takers excluded)
Relevance: HIGH -- School-level SAT averages for thousands of California schools. Can compare elite public magnets (e.g., Lowell, Mission San Jose) vs. average public schools to validate our school-type distributions. Old SAT scale (2400 max) so requires conversion.
2024 Report: https://reports.collegeboard.org/media/pdf/2024-total-group-sat-suite-of-assessments-annual-report-ADA.pdf
2023 Report: https://reports.collegeboard.org/media/pdf/2023-total-group-sat-suite-of-assessments-annual-report%20ADA.pdf
Data Archive: https://reports.collegeboard.org/sat-suite-program-results/data-archive
State Reports: Available for each state (NY, VA, CO, TX, PA, etc.)
Fields: Score distributions by percentile, mean scores, demographic breakdowns, participation rates
Years: Annual, class of 2023 and 2024 most recent
Granularity: National + state-level
Relevance: VERY HIGH -- The authoritative source for national SAT score distributions and percentiles. Essential for validating the overall score distribution shape and percentile mappings. The 2024 report uses the new 1600-scale SAT.
URL: https://www.kaggle.com/datasets/new-york-city/new-york-city-sat-results
Fields: School name, borough, mean SAT scores by section
Years: 2012
Granularity: School-level (NYC public high schools)
Sample size: ~400+ schools
License: CC0 Public Domain
Relevance: MEDIUM -- School-level NYC data useful for comparing specialized exam schools (Stuyvesant, Bronx Science) vs. neighborhood schools. Validates the spread between "public magnet" and "average public" in our model.
URL: https://www.kaggle.com/datasets/nycopendata/high-schools
Fields: School name, borough, building code, address, lat/lng, phone, enrollment with race breakdown, SAT section averages, testing rates
Years: 2014-2015
Granularity: School-level (one row per accredited NYC high school)
Sample size: All accredited NYC public high schools
File size: ~25 KB (zip)
Relevance: MEDIUM -- Richer demographic overlay than the 2012 dataset. Race breakdown per school combined with SAT scores enables cross-validation of demographic score patterns.
URL: https://www.kaggle.com/datasets/billbasener/sat-score-data-by-state
Fields: Likely state, SAT scores (Math/Verbal), participation rates
Years: Unknown (sourced from Kruschke 2015 textbook)
Granularity: State-level
Sample size: ~50 states
File size: ~1.4 KB (very small)
Relevance: LOW-MEDIUM -- Compact reference dataset, but limited detail. Useful as a quick validation check against other state-level data.
Fields: sex, sat_v (verbal SAT percentile), sat_m (math SAT percentile), sat_sum (combined percentile), hs_gpa, fy_gpa (first-year college GPA)
Years: Not specified
Granularity: Student-level (individual observations)
Sample size: 1,000 students
Relevance: MEDIUM -- One of few student-level datasets with both SAT percentiles and GPA. Useful for calibrating the SAT-GPA correlation in our scoring model. Note: percentiles not raw scores.
URL: https://www.kaggle.com/datasets/berkcangd/college-board-sat
Fields: Unknown (metadata only visible)
Years: Unknown
Granularity: Unknown
File size: ~10 KB (zip)
Relevance: LOW -- Very small dataset, limited documentation. Likely a subset or reformatting of official data.
URL: https://www.kaggle.com/datasets/sahirmaharajj/college-exam-results-sat
Fields: School-level SAT performance data (specific columns unknown)
Years: Unknown
Granularity: School-level
File size: ~10 KB (zip)
Relevance: LOW-MEDIUM -- Documentation sparse. School-level data could be useful but dataset appears small.
URL: https://www.kaggle.com/datasets/samsonqian/college-admissions
Fields: Unknown (metadata only)
Years: Unknown
Granularity: Unknown (likely university-level)
File size: ~223 KB (zip)
Relevance: LOW-MEDIUM -- Title suggests admissions data with demographics; may include test score ranges. Requires download to evaluate.
URL: https://www.kaggle.com/datasets/pandanup/college-admission-data-set
Fields: Unknown (metadata only)
Years: Unknown
File size: ~2 KB (zip)
Relevance: LOW -- Very small dataset.
URL: https://www.act.org/content/act/en/research/services-and-resources/data-and-visualization.html
Fields: Composite scores, subject scores (English, Math, Reading, Science), college readiness measures, superscore distributions (counts, percentages, cumulative)
Years: 10 most-recent graduating classes
Granularity: National, state, and regional
Format: Interactive Tableau dashboards (download availability unclear)
Relevance: MEDIUM -- Our simulation uses SAT primarily, but ACT data provides cross-validation via concordance tables. The superscore distribution database is particularly useful for understanding score distributions.
URL: https://nces.ed.gov/programs/digest/d17/tables/dt17_226.60.asp
Fields: Average ACT Composite, English, Mathematics, Reading, Science scores; percentage of graduates taking ACT
Years: 2013 and 2017 (comparative)
Granularity: State-level (50 states + DC)
Score range: 1-36
Relevance: MEDIUM -- Useful for ACT-to-SAT concordance validation. Participation rates by state reveal selection effects (in mandatory-ACT states, averages are lower due to full population testing).
| Dataset | Granularity | SAT Data? | Years | Usefulness |
|---|---|---|---|---|
| US School Scores | State + Income | Yes (Math/Verbal) | 2005+ | HIGH |
| Elite College Admissions | Univ x Income | Yes (50-pt bands) | 2010-2015 | VERY HIGH |
| California SAT Results | School/District | Yes (CR/M/W) | 2015-16 | HIGH |
| College Board Annual Reports | National/State | Yes (full dist.) | 2023-2024 | VERY HIGH |
| NYC SAT Results (2012) | School | Yes (by section) | 2012 | MEDIUM |
| NYC High Schools (2014-15) | School | Yes + demographics | 2014-15 | MEDIUM |
| SAT by State (Kruschke) | State | Yes | Unknown | LOW-MEDIUM |
| OpenIntro SAT/GPA | Student | Percentiles + GPA | Unknown | MEDIUM |
| ACT Graduating Class | State/National | ACT only | 10 years | MEDIUM |
| NCES ACT by State | State | ACT only | 2013, 2017 | MEDIUM |
Current model values: Elite boarding ~1479, Elite day ~1424, Public magnet ~1380, Competitive suburban ~1310, Average suburban ~1190, Average public ~1130, Rural ~1080, Under-resourced urban ~1010
Best datasets: California school-level data (#3) to compare top public magnets vs. average schools; NYC data (#5, #6) to compare specialized exam schools vs. neighborhood schools.
Best datasets: US School Scores (#1) with income-bracket breakdowns; Elite College Admissions (#2) with 13 income percentile levels cross-tabbed with SAT bands.
Best datasets: College Board Annual Reports (#4) -- the ground truth for national SAT percentile curves. Validates whether our simulated score distribution matches the real-world shape.
Best datasets: Elite College Admissions (#2) -- directly shows which SAT score bands map to admission/attendance at each college tier (Ivy Plus, other elite, selective). Directly comparable to our tier structure.