Kaggle Survey: College Rankings & Institutional Statistics Datasets

Source: kaggle_survey_college_stats.md

Kaggle Survey: College Rankings & Institutional Statistics Datasets

Date: 2026-03-01 Purpose: Survey Kaggle for datasets with college rankings, institutional stats, and metrics beyond what the simulation already has.

What the Simulation Already Has

The project's research_colleges.json and CDS research already include for all 30 target colleges:

Acceptance rates (overall, ED/EA, RD)
SAT/ACT middle-50 ranges
Average unweighted GPA
Yield rates, class sizes, total applicants
ED/EA round types and policies
Testing policy

From the ISLR College dataset (1995 vintage, already referenced in the codebase):

Top10perc, Top25perc (% from top of HS class)
Grad.Rate (graduation rate)
S.F.Ratio (student-faculty ratio)
perc.alumni (alumni donation rate)
Expend (instructional expenditure per student)
Apps, Accept, Enroll counts
Outstate tuition, Room.Board, Books, Personal costs
F.Undergrad, P.Undergrad enrollment
PhD%, Terminal% of faculty

Tier 1: High-Value Datasets

1. US Dept of Education: College Scorecard (Official Kaggle Mirror)

Field	Value
URL	https://www.kaggle.com/datasets/kaggle/college-scorecard
Institutions	~6,500 US colleges/universities
Years	1996-2023 (annual files)
Size	~590 MB
License	US Government Works
Coverage of 30 target schools	30/30 (all Title IV institutions)

Key Columns (beyond ISLR):

ADM_RATE — Acceptance rate (more current than ISLR)
SAT_AVG — Average SAT score
SATVR25/75, SATMT25/75, SATWR25/75 — SAT component percentiles
ACTCM25/75, ACTEN25/75, ACTMT25/75 — ACT percentiles
COSTT4_A — Average cost of attendance (4-year)
NPT4_PUB, NPT4_PRIV — Net price after aid (by income quintile)
MD_EARN_WNE_P10 — Median earnings 10 years post-entry
MD_EARN_WNE_P6 — Median earnings 6 years post-entry
C150_4 — 6-year completion rate (more precise than ISLR Grad.Rate)
RET_FT4 — First-year retention rate
DEBT_MDN — Median student debt at graduation
RPY_3YR_RT — 3-year loan repayment rate
PCTPELL — % receiving Pell grants (socioeconomic indicator)
PCTFLOAN — % receiving federal loans
UGDS — Total undergraduate enrollment
UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, etc. — Racial demographics
PCIP* — Degree distribution by CIP field (50+ fields)
HBCU, MENONLY, WOMENONLY, RELAFFIL — Institutional flags
LOCALE — Urbanization level (city, suburb, town, rural)
CCBASIC — Carnegie Classification

What's NEW beyond ISLR:

Post-graduation earnings (6yr, 10yr) — enables ROI modeling
Student debt and repayment rates
Net price by income quintile (financial accessibility)
First-year retention rate (student satisfaction proxy)
Racial/ethnic demographic breakdowns
Pell grant % (socioeconomic diversity metric)
20+ years of longitudinal data (trend analysis)
Degree field distribution
Carnegie Classification codes

Verdict: ESSENTIAL — Most comprehensive single source. Covers all 30 schools with 1,700+ variables and 20+ years of data. Far exceeds ISLR on every dimension.

2. College Scorecard Raw Data (Tunguz mirror)

Field	Value
URL	https://www.kaggle.com/datasets/tunguz/college-scorecard-raw-data
Size	~304 MB
License	US Government Works
Coverage	30/30

Same underlying data as #1 but in raw CSV format (individual year files). Useful if you need year-by-year files rather than the merged SQLite database.

Verdict: Alternative mirror of Scorecard data. Use if #1 format is inconvenient.

3. University Statistics (theriley106)

Field	Value
URL	https://www.kaggle.com/datasets/theriley106/university-statistics
Institutions	311 US universities
Year	~2017-2018 snapshot
Size	34 KB
License	CC0 Public Domain
Coverage of 30 target schools	~28-30/30 (top 311 includes all target schools)

Columns:

Ranking (US News)
Acceptance-Rate
ACT-Avg, SAT-Avg
Average High School GPA
Tuition
Cost after Financial Aid
Percent Receiving Aid
Business Reputation Score
Engineering Reputation Score
Enrollment Size
Public/Private
Region, City, State, Zip

What's NEW beyond ISLR:

ACT/SAT average scores (ISLR has no test scores)
Business & Engineering reputation scores (unique)
Average high school GPA of enrolled students
Cost after financial aid
US News ranking position

Verdict: USEFUL — Compact, clean dataset with reputation scores not found elsewhere. The business/engineering reputation scores could inform simulation prestige modeling. However, it's a single-year snapshot.

4. 20 Years US University Dataset

Field	Value
URL	https://www.kaggle.com/datasets/shaivyac/20-years-us-university-dataset
Years	1999-2019
Size	~13.7 MB
License	Not specified
Coverage	Likely 30/30 (derived from College Scorecard)

Key Features:

Refined/cleaned version of College Scorecard data
20 years of longitudinal data
Admission rates, financial aid, program info
Pre-cleaned for analysis

What's NEW beyond ISLR:

Two decades of trend data in cleaned format
Admission rate trends over time

Verdict: USEFUL — If you want pre-cleaned longitudinal Scorecard data without processing 590MB of raw files.

Tier 2: Useful Supplementary Datasets

5. National Universities Rankings (thedevastator)

Field	Value
URL	https://www.kaggle.com/datasets/thedevastator/national-universities-rankings-explore-quality-t
Institutions	1,800+ US universities
Year	2017
Coverage	30/30

Columns:

Name, Location, Rank
Tuition and fees (out-of-state and in-state)
Undergraduate Enrollment
Six-year graduation rates
Freshman retention rates
Description text

What's NEW beyond ISLR:

In-state vs out-of-state tuition distinction
Freshman retention rates (not in ISLR)
US News rank positions

Verdict: Good single-year snapshot. Retention rates are useful but also available in Scorecard.

6. US University & College Rankings (joebeachcapital)

Field	Value
URL	https://www.kaggle.com/datasets/joebeachcapital/university-and-college-rankings-us-news
Years	1984-present (historical US News rankings)
Coverage	30/30 (but varies by year — some years only top 50)

Columns:

Historical US News ranking positions
IPEDS ID numbers (for cross-referencing)
Year/Tier information

What's NEW beyond ISLR:

40 years of ranking history
IPEDS IDs for linking datasets
Ranking trajectory over time

Verdict: NICHE but valuable — Unique historical ranking trajectories. Could model prestige drift or validate tier assignments.

7. 2022 USA Undergrad College Rankings & More

Field	Value
URL	https://www.kaggle.com/datasets/neelgajare/2022-usa-college-rankings-more
Institutions	392 universities
Year	2022
Size	~7 KB
Coverage	30/30

Columns:

US News ranking and overall score (0-100)
Tuition
Enrollment

What's NEW beyond ISLR:

US News overall score (0-100) — quantitative prestige metric
More recent data (2022 vs 1995)

Verdict: Minimal — the overall score could be useful but limited field coverage.

8. American University Rankings - Top 150 (Niche)

Field	Value
URL	https://www.kaggle.com/datasets/peterpenner445/american-university-rankings-top-150
Institutions	150 universities
Year	2019
Source	Niche.com
Coverage	~28/30 (top 150, may miss some LACs)

Columns:

Institution name
Acceptance Rate
SAT 25th-75th percentile range
Average Cost After Financial Aid
Location

What's NEW beyond ISLR:

SAT percentile ranges (not in ISLR)
Cost after financial aid
Niche-specific rankings (different methodology from US News)

Verdict: Small, clean dataset but largely redundant with Scorecard data.

Tier 3: World/International Rankings (Limited US Admissions Value)

9. World University Rankings (mylesoneill)

Field	Value
URL	https://www.kaggle.com/datasets/mylesoneill/world-university-rankings
Systems	Times Higher Education, Shanghai/ARWU, CWUR
Coverage	~25/30 (misses smaller US LACs like Williams, Amherst, Middlebury)

Metrics: Research output, international outlook, teaching quality, citations, industry income, education expenditure as % GDP

Verdict: Research-oriented metrics. LACs are poorly covered. Not directly useful for US admissions simulation.

10. THE World University Rankings 2016-2026

Field	Value
URL	https://www.kaggle.com/datasets/raymondtoo/the-world-university-rankings-2016-2024
Years	2016-2026
Institutions	1,500+ globally
Coverage	~25/30 (misses LACs)

Metrics: Teaching, Research Environment, Citations, Industry Income, International Outlook

Verdict: Research university focus. Useful for research reputation scores but misses LACs.

11. Ultimate University Ranking

Field	Value
URL	https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking
Systems	8 ranking systems combined (CWUR, QS, THE, Shanghai, Nature Index, URAP, Webometrics, GreenMetric)
Years	2011-2023
Coverage	~25/30 (misses LACs)

Verdict: Comprehensive for research universities but LAC gap is a problem for this simulation.

12. IPEDS-Derived Datasets

American University Data (IPEDS)

Field	Value
URL	https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset
Year	2013
Size	~1.2 MB

Post Secondary Education Data (IPEDS)

Field	Value
URL	https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds
Size	~19.3 MB

Metrics: SAT/ACT scores, ethnicity, gender, cost, graduation rates, enrollment, financial data

Verdict: IPEDS data is valuable but the College Scorecard already incorporates IPEDS data with additional earnings/outcomes data. Use Scorecard instead unless you need raw IPEDS variables.

13. US Colleges and Universities (mexwell)

Field	Value
URL	https://www.kaggle.com/datasets/mexwell/us-colleges-and-universities
Institutions	6,559
Year	2018-2019
Source	IPEDS/NCES
Coverage	30/30

Key Feature: Geospatial data (shapefile format) with institution locations.

Verdict: Useful only if the simulation needs geographic coordinates. Not relevant for admissions modeling.

Coverage Matrix: 30 Target Schools

Dataset	HYPSM (5)	Ivy+ (9)	Near-Ivy (7)	Selective (6)	LACs (3)	Total
College Scorecard	5/5	9/9	7/7	6/6	3/3	30/30
University Statistics	5/5	9/9	7/7	6/6	~2/3	~29/30
US News Historical	5/5	9/9	7/7	6/6	3/3	30/30
National Univ Rankings	5/5	9/9	7/7	6/6	3/3	30/30
World Rankings (THE)	5/5	9/9	7/7	6/6	0/3	~27/30
ISLR (already used)	5/5	9/9	7/7	6/6	3/3	30/30

Gap Analysis: What Kaggle Adds Beyond Current Data

Already Well-Covered (no Kaggle needed)

Acceptance rates (CDS data is more current than any Kaggle dataset)
SAT/ACT ranges (CDS is more current)
GPA distributions (CDS is more current)
Yield rates (CDS is more current)
ED/EA round dynamics (CDS is more current)

High-Value Additions from Kaggle

Data Point	Source	Simulation Use
Median earnings 10yr post-entry	Scorecard	Student utility function, ROI-based college choice
Student debt at graduation	Scorecard	Financial accessibility modeling
Net price by income quintile	Scorecard	Socioeconomic stratification
First-year retention rate	Scorecard	College quality indicator, student satisfaction proxy
6-year completion rate	Scorecard	Outcome modeling, graduation probability
Pell grant %	Scorecard	Socioeconomic diversity of student body
Racial demographic breakdown	Scorecard	Diversity modeling
Degree field distribution (PCIP)	Scorecard	Major-based preference modeling
Business/Engineering reputation	Univ. Statistics	Discipline-specific prestige
Historical ranking trajectory	US News Historical	Prestige drift modeling
Carnegie Classification	Scorecard	Institution type categorization
Urbanization level (LOCALE)	Scorecard	Geographic preference modeling

Not Available on Kaggle (still need CDS/other sources)

Hook-specific acceptance rates (legacy, athlete, donor)
ED vs EA vs RD acceptance rates (only overall ADM_RATE in Scorecard)
Letter of recommendation policies
Interview requirements/policies
Demonstrated interest policies
Waitlist conversion rates

Recommendations

Priority 1: College Scorecard (Essential)

Download the Kaggle mirror for:

Earnings data (MD_EARN_WNE_P6, MD_EARN_WNE_P10)
Student debt (DEBT_MDN)
Net price by income (NPT4_PRIV, NPT41_PRIV through NPT45_PRIV)
Retention rate (RET_FT4)
Demographics (UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN)
Pell % (PCTPELL)
Completion rate (C150_4)

These enable ROI-aware student decision-making in the simulation.

Priority 2: University Statistics (Nice-to-have)

For business/engineering reputation scores — useful for modeling discipline-specific prestige that goes beyond overall rankings.

Priority 3: US News Historical Rankings (Nice-to-have)

40 years of ranking data enables prestige trajectory analysis, though the simulation currently uses static tiers.

Lower Priority

World rankings datasets: Not useful — they miss LACs and focus on research metrics irrelevant to US undergrad admissions.
IPEDS mirrors: Redundant with College Scorecard.
2022 Rankings: Too thin on fields.
Niche Rankings: Redundant with better sources.