Source: kaggle_survey_college_stats.md
Date: 2026-03-01 Purpose: Survey Kaggle for datasets with college rankings, institutional stats, and metrics beyond what the simulation already has.
The project's research_colleges.json and CDS research already include for all 30 target colleges:
Acceptance rates (overall, ED/EA, RD)
SAT/ACT middle-50 ranges
Average unweighted GPA
Yield rates, class sizes, total applicants
ED/EA round types and policies
Testing policy
From the ISLR College dataset (1995 vintage, already referenced in the codebase):
Top10perc, Top25perc (% from top of HS class)
Grad.Rate (graduation rate)
S.F.Ratio (student-faculty ratio)
perc.alumni (alumni donation rate)
Expend (instructional expenditure per student)
Apps, Accept, Enroll counts
Outstate tuition, Room.Board, Books, Personal costs
F.Undergrad, P.Undergrad enrollment
PhD%, Terminal% of faculty
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/kaggle/college-scorecard |
| Institutions | ~6,500 US colleges/universities |
| Years | 1996-2023 (annual files) |
| Size | ~590 MB |
| License | US Government Works |
| Coverage of 30 target schools | 30/30 (all Title IV institutions) |
Key Columns (beyond ISLR):
ADM_RATE — Acceptance rate (more current than ISLR)
SAT_AVG — Average SAT score
SATVR25/75, SATMT25/75, SATWR25/75 — SAT component percentiles
ACTCM25/75, ACTEN25/75, ACTMT25/75 — ACT percentiles
COSTT4_A — Average cost of attendance (4-year)
NPT4_PUB, NPT4_PRIV — Net price after aid (by income quintile)
MD_EARN_WNE_P10 — Median earnings 10 years post-entry
MD_EARN_WNE_P6 — Median earnings 6 years post-entry
C150_4 — 6-year completion rate (more precise than ISLR Grad.Rate)
RET_FT4 — First-year retention rate
DEBT_MDN — Median student debt at graduation
RPY_3YR_RT — 3-year loan repayment rate
PCTPELL — % receiving Pell grants (socioeconomic indicator)
PCTFLOAN — % receiving federal loans
UGDS — Total undergraduate enrollment
UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, etc. — Racial demographics
PCIP* — Degree distribution by CIP field (50+ fields)
HBCU, MENONLY, WOMENONLY, RELAFFIL — Institutional flags
LOCALE — Urbanization level (city, suburb, town, rural)
CCBASIC — Carnegie Classification
What's NEW beyond ISLR:
Post-graduation earnings (6yr, 10yr) — enables ROI modeling
Student debt and repayment rates
Net price by income quintile (financial accessibility)
First-year retention rate (student satisfaction proxy)
Racial/ethnic demographic breakdowns
Pell grant % (socioeconomic diversity metric)
20+ years of longitudinal data (trend analysis)
Degree field distribution
Carnegie Classification codes
Verdict: ESSENTIAL — Most comprehensive single source. Covers all 30 schools with 1,700+ variables and 20+ years of data. Far exceeds ISLR on every dimension.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/tunguz/college-scorecard-raw-data |
| Size | ~304 MB |
| License | US Government Works |
| Coverage | 30/30 |
Same underlying data as #1 but in raw CSV format (individual year files). Useful if you need year-by-year files rather than the merged SQLite database.
Verdict: Alternative mirror of Scorecard data. Use if #1 format is inconvenient.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/theriley106/university-statistics |
| Institutions | 311 US universities |
| Year | ~2017-2018 snapshot |
| Size | 34 KB |
| License | CC0 Public Domain |
| Coverage of 30 target schools | ~28-30/30 (top 311 includes all target schools) |
Columns:
Ranking (US News)
Acceptance-Rate
ACT-Avg, SAT-Avg
Average High School GPA
Tuition
Cost after Financial Aid
Percent Receiving Aid
Business Reputation Score
Engineering Reputation Score
Enrollment Size
Public/Private
Region, City, State, Zip
What's NEW beyond ISLR:
ACT/SAT average scores (ISLR has no test scores)
Business & Engineering reputation scores (unique)
Average high school GPA of enrolled students
Cost after financial aid
US News ranking position
Verdict: USEFUL — Compact, clean dataset with reputation scores not found elsewhere. The business/engineering reputation scores could inform simulation prestige modeling. However, it's a single-year snapshot.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/shaivyac/20-years-us-university-dataset |
| Years | 1999-2019 |
| Size | ~13.7 MB |
| License | Not specified |
| Coverage | Likely 30/30 (derived from College Scorecard) |
Key Features:
Refined/cleaned version of College Scorecard data
20 years of longitudinal data
Admission rates, financial aid, program info
Pre-cleaned for analysis
What's NEW beyond ISLR:
Two decades of trend data in cleaned format
Admission rate trends over time
Verdict: USEFUL — If you want pre-cleaned longitudinal Scorecard data without processing 590MB of raw files.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/thedevastator/national-universities-rankings-explore-quality-t |
| Institutions | 1,800+ US universities |
| Year | 2017 |
| Coverage | 30/30 |
Columns:
Name, Location, Rank
Tuition and fees (out-of-state and in-state)
Undergraduate Enrollment
Six-year graduation rates
Freshman retention rates
Description text
What's NEW beyond ISLR:
In-state vs out-of-state tuition distinction
Freshman retention rates (not in ISLR)
US News rank positions
Verdict: Good single-year snapshot. Retention rates are useful but also available in Scorecard.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/joebeachcapital/university-and-college-rankings-us-news |
| Years | 1984-present (historical US News rankings) |
| Coverage | 30/30 (but varies by year — some years only top 50) |
Columns:
Historical US News ranking positions
IPEDS ID numbers (for cross-referencing)
Year/Tier information
What's NEW beyond ISLR:
40 years of ranking history
IPEDS IDs for linking datasets
Ranking trajectory over time
Verdict: NICHE but valuable — Unique historical ranking trajectories. Could model prestige drift or validate tier assignments.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/neelgajare/2022-usa-college-rankings-more |
| Institutions | 392 universities |
| Year | 2022 |
| Size | ~7 KB |
| Coverage | 30/30 |
Columns:
US News ranking and overall score (0-100)
Tuition
Enrollment
What's NEW beyond ISLR:
US News overall score (0-100) — quantitative prestige metric
More recent data (2022 vs 1995)
Verdict: Minimal — the overall score could be useful but limited field coverage.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/peterpenner445/american-university-rankings-top-150 |
| Institutions | 150 universities |
| Year | 2019 |
| Source | Niche.com |
| Coverage | ~28/30 (top 150, may miss some LACs) |
Columns:
Institution name
Acceptance Rate
SAT 25th-75th percentile range
Average Cost After Financial Aid
Location
What's NEW beyond ISLR:
SAT percentile ranges (not in ISLR)
Cost after financial aid
Niche-specific rankings (different methodology from US News)
Verdict: Small, clean dataset but largely redundant with Scorecard data.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/mylesoneill/world-university-rankings |
| Systems | Times Higher Education, Shanghai/ARWU, CWUR |
| Coverage | ~25/30 (misses smaller US LACs like Williams, Amherst, Middlebury) |
Metrics: Research output, international outlook, teaching quality, citations, industry income, education expenditure as % GDP
Verdict: Research-oriented metrics. LACs are poorly covered. Not directly useful for US admissions simulation.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/raymondtoo/the-world-university-rankings-2016-2024 |
| Years | 2016-2026 |
| Institutions | 1,500+ globally |
| Coverage | ~25/30 (misses LACs) |
Metrics: Teaching, Research Environment, Citations, Industry Income, International Outlook
Verdict: Research university focus. Useful for research reputation scores but misses LACs.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking |
| Systems | 8 ranking systems combined (CWUR, QS, THE, Shanghai, Nature Index, URAP, Webometrics, GreenMetric) |
| Years | 2011-2023 |
| Coverage | ~25/30 (misses LACs) |
Verdict: Comprehensive for research universities but LAC gap is a problem for this simulation.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset |
| Year | 2013 |
| Size | ~1.2 MB |
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds |
| Size | ~19.3 MB |
Metrics: SAT/ACT scores, ethnicity, gender, cost, graduation rates, enrollment, financial data
Verdict: IPEDS data is valuable but the College Scorecard already incorporates IPEDS data with additional earnings/outcomes data. Use Scorecard instead unless you need raw IPEDS variables.
| Field | Value |
|---|---|
| URL | https://www.kaggle.com/datasets/mexwell/us-colleges-and-universities |
| Institutions | 6,559 |
| Year | 2018-2019 |
| Source | IPEDS/NCES |
| Coverage | 30/30 |
Key Feature: Geospatial data (shapefile format) with institution locations.
Verdict: Useful only if the simulation needs geographic coordinates. Not relevant for admissions modeling.
| Dataset | HYPSM (5) | Ivy+ (9) | Near-Ivy (7) | Selective (6) | LACs (3) | Total |
|---|---|---|---|---|---|---|
| College Scorecard | 5/5 | 9/9 | 7/7 | 6/6 | 3/3 | 30/30 |
| University Statistics | 5/5 | 9/9 | 7/7 | 6/6 | ~2/3 | ~29/30 |
| US News Historical | 5/5 | 9/9 | 7/7 | 6/6 | 3/3 | 30/30 |
| National Univ Rankings | 5/5 | 9/9 | 7/7 | 6/6 | 3/3 | 30/30 |
| World Rankings (THE) | 5/5 | 9/9 | 7/7 | 6/6 | 0/3 | ~27/30 |
| ISLR (already used) | 5/5 | 9/9 | 7/7 | 6/6 | 3/3 | 30/30 |
Acceptance rates (CDS data is more current than any Kaggle dataset)
SAT/ACT ranges (CDS is more current)
GPA distributions (CDS is more current)
Yield rates (CDS is more current)
ED/EA round dynamics (CDS is more current)
| Data Point | Source | Simulation Use |
|---|---|---|
| Median earnings 10yr post-entry | Scorecard | Student utility function, ROI-based college choice |
| Student debt at graduation | Scorecard | Financial accessibility modeling |
| Net price by income quintile | Scorecard | Socioeconomic stratification |
| First-year retention rate | Scorecard | College quality indicator, student satisfaction proxy |
| 6-year completion rate | Scorecard | Outcome modeling, graduation probability |
| Pell grant % | Scorecard | Socioeconomic diversity of student body |
| Racial demographic breakdown | Scorecard | Diversity modeling |
| Degree field distribution (PCIP) | Scorecard | Major-based preference modeling |
| Business/Engineering reputation | Univ. Statistics | Discipline-specific prestige |
| Historical ranking trajectory | US News Historical | Prestige drift modeling |
| Carnegie Classification | Scorecard | Institution type categorization |
| Urbanization level (LOCALE) | Scorecard | Geographic preference modeling |
Hook-specific acceptance rates (legacy, athlete, donor)
ED vs EA vs RD acceptance rates (only overall ADM_RATE in Scorecard)
Letter of recommendation policies
Interview requirements/policies
Demonstrated interest policies
Waitlist conversion rates
Download the Kaggle mirror for:
Earnings data (MD_EARN_WNE_P6, MD_EARN_WNE_P10)
Student debt (DEBT_MDN)
Net price by income (NPT4_PRIV, NPT41_PRIV through NPT45_PRIV)
Retention rate (RET_FT4)
Demographics (UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN)
Pell % (PCTPELL)
Completion rate (C150_4)
These enable ROI-aware student decision-making in the simulation.
For business/engineering reputation scores — useful for modeling discipline-specific prestige that goes beyond overall rankings.
40 years of ranking data enables prestige trajectory analysis, though the simulation currently uses static tiers.
World rankings datasets: Not useful — they miss LACs and focus on research metrics irrelevant to US undergrad admissions.
IPEDS mirrors: Redundant with College Scorecard.
2022 Rankings: Too thin on fields.
Niche Rankings: Redundant with better sources.