Kaggle Survey: College Rankings & Institutional Statistics Datasets

Source: kaggle_survey_college_stats.md


Kaggle Survey: College Rankings & Institutional Statistics Datasets

Date: 2026-03-01 Purpose: Survey Kaggle for datasets with college rankings, institutional stats, and metrics beyond what the simulation already has.


What the Simulation Already Has

The project's research_colleges.json and CDS research already include for all 30 target colleges:

From the ISLR College dataset (1995 vintage, already referenced in the codebase):


Tier 1: High-Value Datasets

1. US Dept of Education: College Scorecard (Official Kaggle Mirror)

Field Value
URL https://www.kaggle.com/datasets/kaggle/college-scorecard
Institutions ~6,500 US colleges/universities
Years 1996-2023 (annual files)
Size ~590 MB
License US Government Works
Coverage of 30 target schools 30/30 (all Title IV institutions)

Key Columns (beyond ISLR):

What's NEW beyond ISLR:

Verdict: ESSENTIAL — Most comprehensive single source. Covers all 30 schools with 1,700+ variables and 20+ years of data. Far exceeds ISLR on every dimension.


2. College Scorecard Raw Data (Tunguz mirror)

Field Value
URL https://www.kaggle.com/datasets/tunguz/college-scorecard-raw-data
Size ~304 MB
License US Government Works
Coverage 30/30

Same underlying data as #1 but in raw CSV format (individual year files). Useful if you need year-by-year files rather than the merged SQLite database.

Verdict: Alternative mirror of Scorecard data. Use if #1 format is inconvenient.


3. University Statistics (theriley106)

Field Value
URL https://www.kaggle.com/datasets/theriley106/university-statistics
Institutions 311 US universities
Year ~2017-2018 snapshot
Size 34 KB
License CC0 Public Domain
Coverage of 30 target schools ~28-30/30 (top 311 includes all target schools)

Columns:

What's NEW beyond ISLR:

Verdict: USEFUL — Compact, clean dataset with reputation scores not found elsewhere. The business/engineering reputation scores could inform simulation prestige modeling. However, it's a single-year snapshot.


4. 20 Years US University Dataset

Field Value
URL https://www.kaggle.com/datasets/shaivyac/20-years-us-university-dataset
Years 1999-2019
Size ~13.7 MB
License Not specified
Coverage Likely 30/30 (derived from College Scorecard)

Key Features:

What's NEW beyond ISLR:

Verdict: USEFUL — If you want pre-cleaned longitudinal Scorecard data without processing 590MB of raw files.


Tier 2: Useful Supplementary Datasets

5. National Universities Rankings (thedevastator)

Field Value
URL https://www.kaggle.com/datasets/thedevastator/national-universities-rankings-explore-quality-t
Institutions 1,800+ US universities
Year 2017
Coverage 30/30

Columns:

What's NEW beyond ISLR:

Verdict: Good single-year snapshot. Retention rates are useful but also available in Scorecard.


6. US University & College Rankings (joebeachcapital)

Field Value
URL https://www.kaggle.com/datasets/joebeachcapital/university-and-college-rankings-us-news
Years 1984-present (historical US News rankings)
Coverage 30/30 (but varies by year — some years only top 50)

Columns:

What's NEW beyond ISLR:

Verdict: NICHE but valuable — Unique historical ranking trajectories. Could model prestige drift or validate tier assignments.


7. 2022 USA Undergrad College Rankings & More

Field Value
URL https://www.kaggle.com/datasets/neelgajare/2022-usa-college-rankings-more
Institutions 392 universities
Year 2022
Size ~7 KB
Coverage 30/30

Columns:

What's NEW beyond ISLR:

Verdict: Minimal — the overall score could be useful but limited field coverage.


8. American University Rankings - Top 150 (Niche)

Field Value
URL https://www.kaggle.com/datasets/peterpenner445/american-university-rankings-top-150
Institutions 150 universities
Year 2019
Source Niche.com
Coverage ~28/30 (top 150, may miss some LACs)

Columns:

What's NEW beyond ISLR:

Verdict: Small, clean dataset but largely redundant with Scorecard data.


Tier 3: World/International Rankings (Limited US Admissions Value)

9. World University Rankings (mylesoneill)

Field Value
URL https://www.kaggle.com/datasets/mylesoneill/world-university-rankings
Systems Times Higher Education, Shanghai/ARWU, CWUR
Coverage ~25/30 (misses smaller US LACs like Williams, Amherst, Middlebury)

Metrics: Research output, international outlook, teaching quality, citations, industry income, education expenditure as % GDP

Verdict: Research-oriented metrics. LACs are poorly covered. Not directly useful for US admissions simulation.


10. THE World University Rankings 2016-2026

Field Value
URL https://www.kaggle.com/datasets/raymondtoo/the-world-university-rankings-2016-2024
Years 2016-2026
Institutions 1,500+ globally
Coverage ~25/30 (misses LACs)

Metrics: Teaching, Research Environment, Citations, Industry Income, International Outlook

Verdict: Research university focus. Useful for research reputation scores but misses LACs.


11. Ultimate University Ranking

Field Value
URL https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking
Systems 8 ranking systems combined (CWUR, QS, THE, Shanghai, Nature Index, URAP, Webometrics, GreenMetric)
Years 2011-2023
Coverage ~25/30 (misses LACs)

Verdict: Comprehensive for research universities but LAC gap is a problem for this simulation.


12. IPEDS-Derived Datasets

American University Data (IPEDS)

Field Value
URL https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset
Year 2013
Size ~1.2 MB

Post Secondary Education Data (IPEDS)

Field Value
URL https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds
Size ~19.3 MB

Metrics: SAT/ACT scores, ethnicity, gender, cost, graduation rates, enrollment, financial data

Verdict: IPEDS data is valuable but the College Scorecard already incorporates IPEDS data with additional earnings/outcomes data. Use Scorecard instead unless you need raw IPEDS variables.


13. US Colleges and Universities (mexwell)

Field Value
URL https://www.kaggle.com/datasets/mexwell/us-colleges-and-universities
Institutions 6,559
Year 2018-2019
Source IPEDS/NCES
Coverage 30/30

Key Feature: Geospatial data (shapefile format) with institution locations.

Verdict: Useful only if the simulation needs geographic coordinates. Not relevant for admissions modeling.


Coverage Matrix: 30 Target Schools

Dataset HYPSM (5) Ivy+ (9) Near-Ivy (7) Selective (6) LACs (3) Total
College Scorecard 5/5 9/9 7/7 6/6 3/3 30/30
University Statistics 5/5 9/9 7/7 6/6 ~2/3 ~29/30
US News Historical 5/5 9/9 7/7 6/6 3/3 30/30
National Univ Rankings 5/5 9/9 7/7 6/6 3/3 30/30
World Rankings (THE) 5/5 9/9 7/7 6/6 0/3 ~27/30
ISLR (already used) 5/5 9/9 7/7 6/6 3/3 30/30

Gap Analysis: What Kaggle Adds Beyond Current Data

Already Well-Covered (no Kaggle needed)

High-Value Additions from Kaggle

Data Point Source Simulation Use
Median earnings 10yr post-entry Scorecard Student utility function, ROI-based college choice
Student debt at graduation Scorecard Financial accessibility modeling
Net price by income quintile Scorecard Socioeconomic stratification
First-year retention rate Scorecard College quality indicator, student satisfaction proxy
6-year completion rate Scorecard Outcome modeling, graduation probability
Pell grant % Scorecard Socioeconomic diversity of student body
Racial demographic breakdown Scorecard Diversity modeling
Degree field distribution (PCIP) Scorecard Major-based preference modeling
Business/Engineering reputation Univ. Statistics Discipline-specific prestige
Historical ranking trajectory US News Historical Prestige drift modeling
Carnegie Classification Scorecard Institution type categorization
Urbanization level (LOCALE) Scorecard Geographic preference modeling

Not Available on Kaggle (still need CDS/other sources)


Recommendations

Priority 1: College Scorecard (Essential)

Download the Kaggle mirror for:

These enable ROI-aware student decision-making in the simulation.

Priority 2: University Statistics (Nice-to-have)

For business/engineering reputation scores — useful for modeling discipline-specific prestige that goes beyond overall rankings.

Priority 3: US News Historical Rankings (Nice-to-have)

40 years of ranking data enables prestige trajectory analysis, though the simulation currently uses static tiers.

Lower Priority