Source: kaggle_survey_financial_aid.md
Survey of Kaggle datasets and related research relevant to modeling student yield decisions, financial aid elasticity, and net price impacts on enrollment in the college admissions simulation.
URL: https://www.kaggle.com/datasets/kaggle/college-scorecard
Size: 589.6 MB (compressed)
Coverage: ~6,500 institutions, multiple years (1997-present)
License: CC0 Public Domain
Key Financial Aid & Net Price Fields (from College Scorecard data dictionary):
| Variable | Description |
|---|---|
cost_tuition_in |
In-state tuition and fees |
cost_tuition_out |
Out-of-state tuition and fees |
cost_books |
Estimated books and supplies |
cost_room_board_on |
On-campus room and board |
cost_room_board_off |
Off-campus room and board |
cost_avg (NPT4) |
Average net price for Title IV institutions |
cost_avg_income_0_30k (NPT41) |
Average net price, family income $0-$30K |
cost_avg_income_30_48k (NPT42) |
Average net price, family income $30K-$48K |
cost_avg_income_48_75k (NPT43) |
Average net price, family income $48K-$75K |
cost_avg_income_75_110k (NPT44) |
Average net price, family income $75K-$110K |
cost_avg_income_110k_plus (NPT45) |
Average net price, family income $110K+ |
rate_admissions |
Admission rate |
n_undergrads |
Undergraduate enrollment |
rate_completion |
Completion rate (4-year, first-time, full-time) |
amnt_earnings_med_10y |
Median earnings 10 years after entry |
Relevance to Simulation: Net price by income bracket directly maps to financial aid modeling. The NPT41-NPT45 breakdown shows how aid varies by family income tier, useful for calibrating yield differences across income groups.
Limitations: Does not include per-institution yield rate, Pell grant counts, or institutional grant breakdowns in the simplified R dataset. The raw College Scorecard files (589 MB) contain 3,000+ columns with much more detail.
URL: https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay
Size: 1.95 MB
Coverage: 2018-2019 tuition data, 2014 diversity data, 1985-2016 historical averages
License: MIT
Sources: Chronicle of Higher Education, NCES, TuitionTracker.org, PayScale.com
Key Fields:
Tuition and fees (in-state vs out-of-state)
Net cost by income bracket (from TuitionTracker.org)
School type, degree length, state
Salary potential (from PayScale)
Student diversity metrics by race/ethnicity
Graduation rates
Relevance to Simulation: Net cost by income bracket is directly useful for yield modeling. Price trends from 1985-2016 provide historical context. Salary data can inform return-on-investment calculations that affect student choice.
URL: https://www.kaggle.com/datasets/mexwell/elite-college-admissions
Size: 371 KB
Coverage: 139 selective universities, entering classes 2010-2015, ~2.4 million students
License: Not specified
Key Fields:
| Field | Description |
|---|---|
| Attendance rate | Raw and test-score-reweighted |
| Application rate | By income bracket |
| Admission rates | By income bracket |
| Matriculation rates | By income bracket (this IS yield rate) |
| Parental income | Via tax records, 13 income percentile bins |
| SAT/ACT scores | By income bracket |
| Earnings | Post-graduation by income bracket |
College Tiers (6 categories):
Relevance to Simulation: This is the MOST directly relevant dataset. Contains:
Yield (matriculation) rates broken down by parental income for elite colleges
Application and admission rates by income bracket
Test-score-reweighted attendance rates (controls for academic quality)
Data specifically for the tier of colleges modeled in our simulation
Shows how yield varies with family wealth at elite institutions
Key Insight: The 13 income bins (up to 99th-99.9th percentile and top 1%) allow modeling how wealthy families respond differently to admissions offers than low-income families -- critical for yield modeling.
URL: https://www.kaggle.com/datasets/yashgpt/us-college-data
Size: 32 KB
Coverage: 777 colleges, 18 variables
Key Fields:
| Field | Description |
|---|---|
| Apps | Number of applications received |
| Accept | Number accepted |
| Enroll | Number enrolled |
| Private | Public vs private indicator |
| Outstate | Out-of-state tuition |
| Room.Board | Room and board costs |
| Expend | Instructional expenditure per student |
| Grad.Rate | Graduation rate |
| Top10perc | % students from top 10% of HS class |
| Top25perc | % students from top 25% of HS class |
| perc.alumni | % of alumni who donate |
Yield Rate: Calculable as Enroll / Accept for each institution.
Relevance to Simulation: Simple dataset with raw admit/enroll numbers. Yield can be computed directly. Covers 777 schools including elite ones.
URL: https://www.kaggle.com/datasets/theriley106/college-common-data-sets
Size: 220 MB
Coverage: 173 colleges, multiple years (structured CDS format)
License: CC0 Public Domain
Key CDS Sections (standard format):
Section B: Enrollment and persistence (enrolled counts)
Section C: First-time, first-year admission (applicants, admitted, enrolled, wait-listed)
Section H: Financial aid
Need-based aid (grants, scholarships, self-help, loans)
Non-need-based merit aid
Average need-based scholarship/grant award
Average need-based self-help award
Average indebtedness at graduation
% of need met
Average financial aid package
Colleges Include: Cornell, Carnegie Mellon, Pomona, Smith, Wellesley, Colby, Rensselaer, Michigan Tech, and ~165 others.
Relevance to Simulation: CDS Section H is the gold standard for financial aid data. Contains need-based vs merit aid breakdowns, percentage of need met, and average aid packages -- exactly what we need for yield modeling. Section C provides admit/enrolled for yield calculation.
URL: https://www.kaggle.com/datasets/thedevastator/unlock-college-performance-debt-and-earnings-out
Size: 16.9 MB
Coverage: U.S. colleges (count not specified)
Key Fields: Cost of attendance, average salary after graduation, loan repayment rates, gainful employment rates, student demographics, faculty diversity, campus cultural climate.
Relevance to Simulation: Student debt and earnings outcomes affect perceived value of attendance, which influences yield decisions.
URL: https://www.kaggle.com/datasets/hark99/post-secondary-education-data-ipeds
Size: 19.3 MB
Coverage: IPEDS data for institutions participating in federal aid
Relevance to Simulation: IPEDS is the comprehensive federal data source. Contains admissions, enrollment, financial aid, and institutional characteristics. The Kaggle version may not have all 250+ IPEDS variables but provides a cleaned subset.
URL: https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset
Size: 1.23 MB
Coverage: American universities from IPEDS
License: CC0 Public Domain
Focus: Enrollment rate prediction, graduation rate prediction, cost analysis
Relevance to Simulation: Designed for predicting enrollment rates, directly applicable to yield modeling.
URL: https://www.kaggle.com/datasets/nivedithavudayagiri/college-enrollment-demographics-2021
Size: 1.88 MB
Coverage: 2020-2021 academic year, unduplicated headcount
Key Fields: UNITID (IPEDS ID), enrollment by level, full-time/part-time, first-time/transfer/continuing, gender, 9 race/ethnicity categories.
Limitations: Enrollment counts only, no financial aid data.
URL: https://www.kaggle.com/datasets/thedevastator/u-s-department-of-education-college-scorecard-da
Size: 1.18 MB
Coverage: Nearly every U.S. college/university
Key Fields: UNITID, INSTNM, location fields, NPCURL (Net Price Calculator URL), enrollment totals, retention rate, degree info, Carnegie Classification, minority-serving institution flags.
Applicant counts, admission counts, acceptance rates
Admission yields (enrolled / admitted)
SAT and ACT test scores
Available for all non-open-admission institutions
Annual collection since 2001
Students receiving aid by type (federal, state, institutional)
Pell Grant recipients and amounts
Institutional grant/scholarship aid
Need-based vs non-need-based aid
Net price by family income quintile
Average aid amounts by category
Available for all Title IV institutions
7,000+ institutions, 250+ variables, CSV format
Custom data extracts available via IPEDS Data Center
Source: NBER Working Paper #9482
Key Findings:
$1,000 increase in grants raises enrollment probability by ~11%
$1,000 increase in loans raises enrollment probability by ~7%
$1,000 tuition increase lowers enrollment probability by ~2%
Front-loaded grants (more money freshman year) have significantly stronger effects
"Named scholarships" generate stronger enrollment response than equivalent dollar grants
~30% of high-aptitude students respond to aid irrationally (reducing lifetime present value)
Students respond to gross tuition, not just net price (tuition + equal aid increase still lowers demand)
| Study/Context | Price Elasticity | Notes |
|---|---|---|
| Aggregate (all 4-year) | -0.44 | Own-price elasticity |
| Public universities | -1.058 | More price-sensitive |
| Private institutions | -0.6414 | Less price-sensitive |
| Full-paying students (selective) | -0.76 | At selective colleges |
| Financial-aid students (selective) | -1.18 | More responsive to price changes |
| Occidental College (individual) | -0.72 | Single institution study |
All 4-year institutions: 1.20
Public universities: 0.977
Private institutions: 1.701 (higher income = much more likely to attend private)
HYPSM yield: 70-87%
Ivy+ yield: 55-65%
$1K more aid = 2-4pp yield increase
Yield Rate Sources:
Use Elite College Admissions dataset (Opportunity Insights) for tier-specific yield by income bracket at top colleges
Use US College Data (Apps/Accept/Enroll) for broader yield calculations
Use College Common Data Sets Section C for institution-specific yield
Financial Aid Elasticity:
Current parameter ($1K = 2-4pp) is conservative relative to Avery/Hoxby finding ($1K grant = ~11pp)
However, the 11pp finding is for high-aptitude students choosing among options; at elite schools with high baseline yield, marginal effect may be smaller
Consider differentiating by income: aid-receiving students are more price-elastic (-1.18) than full-pay students (-0.76)
Front-loaded aid and named scholarships have outsized effects (behavioral economics)
Net Price by Income:
College Scorecard NPT41-NPT45 fields provide real net price by income bracket
For HYPSM: net price is near $0 for families under $75K (meets full need)
For selective schools: significant unmet need, merit aid as enrollment lever
Income-Differentiated Yield:
Elite College Admissions dataset shows yield varies dramatically by parental income
Wealthy families have more options -> potentially lower yield at any single school
Low-income students with full financial aid -> very high yield at elite schools
This matches real-world pattern: Harvard yield for Pell-eligible students is ~90%+
Aid elasticity should vary by student type:
Full-pay families: $1K = ~1-2pp yield change (low sensitivity)
Aid-receiving families: $1K = ~3-5pp yield change (moderate sensitivity)
Low-income families at schools not meeting full need: $1K = ~5-8pp yield change
| Dataset | Yield Data | Financial Aid | Net Price by Income | Elite Colleges | Size |
|---|---|---|---|---|---|
| College Scorecard | No | Partial | Yes (5 brackets) | Yes | 589 MB |
| Tuition/Diversity/Pay | No | Net cost | Yes (by bracket) | Yes | 2 MB |
| Elite College Admissions | Yes (by income) | No | No | Yes (139) | 371 KB |
| US College Data | Calculable | No | No | Partial (777) | 32 KB |
| College Common Data Sets | Yes (Section C) | Yes (Section H) | Partial | Partial (173) | 220 MB |
| IPEDS (direct) | Yes (ADM) | Yes (SFA) | Yes | Yes (7,000+) | Custom |
| College Perf/Debt/Earnings | No | Partial | No | Yes | 17 MB |
| Post-Secondary (IPEDS) | Unknown | Unknown | Unknown | Yes | 19 MB |
For the simulation, the most actionable approach: