Wharton High School Data Science Competition
Presented by
2026 Competition Playbook
This playbook is designed to help you look back on your journey through the 2026 Wharton High School Data Science Competition, presented by Google Gemini. Here, we are highlighting the strategies student teams used, the challenges they tackled, and the expert approaches recommended by our team at the Wharton AI and Analytics Initiative (WAIAI).
Below, you’ll find insights into how student teams cleaned and explored data, built predictive models, and communicated their findings. You’ll also see how the WAIAI team approached the same problems, offering a “gold standard” for comparison and learning.
Use this playbook to reflect on what you tried, what you learned, and how you can keep growing as a data scientist. We hope it serves not only as a record of your hard work but also as a learning resource for your next steps in data science and sports analytics — and that it inspires your next great project!
Preparing and Understanding Your Data
The first phase of the competition challenged student teams to dive deep into a large dataset and extract meaningful insights to rank fictional World Hockey League (WHL) teams, predict first-round matchups, understand offensive line quality disparities, and visualize relationships between those measures of offensive line disparity and team strength. This phase required analytical horsepower, teamwork, technical thinking, and a creative approach to complex, simulated hockey data.
Reflect & Learn: How Other Student Teams & Our WAIAI Team Approached the Problem
The solutions built by the WAIAI team followed these principles. Your approach may have differed, but reflecting on these strategies can help you improve and innovate in future competitions.
- Create relevant new variables, such as point differential and possession-adjusted stats, to enrich the dataset
- Incorporate nuanced situational factors — home ice advantage, goaltending, and power play performance — to sharpen predictions
- Apply thoughtful, case-specific strategies for handling missing data
- Leverage statistical tools like regression, logistic models, and Elo ratings to rank teams and forecast outcomes
Data Cleaning
Thoughtful data cleaning can improve model performance—but filtering without clear reasons can do more harm than good, especially if it leads to missing factors that drive outcomes. Several types of observations required careful consideration:
- Empty net shifts: Many student teams removed all empty net scenarios. This is a reasonable starting point, since goals or high-quality chances on an open net are not representative of defensive performance. However, a better approach is to exclude offensive production for the team shooting on an empty net and retain offensive production for the opposing team shooting on a defended net. This preserves meaningful offensive signal while removing distorted defensive outcomes.
- Power Play situations: Some student teams excluded power play shifts from their analyses. Power plays are a critical component of hockey performance that can set strong teams apart, and removing them eliminates a substantial and meaningful portion of the data.
- Low Time On Ice (TOI) pairings: Dropping line combinations with low TOI was another exclusion reported by student teams. Removing these often reduced overall data coverage without a clear benefit – even short shifts contribute meaningful information.
In many cases, it is better to adjust or reinterpret the data rather than remove it entirely. Dropping data should be done selectively, critically, and with clear justification.
Enhancing the Data: Feature Engineering
Feature engineering was an important element in optimal solutions. Some of the most effective submissions took raw game statistics and engineered new, performance-driven variables to gain a strategic edge. Several of these feature engineering steps that proved most useful included:
- Overtime and Time on Ice (TOI): As TOI for combinations of offensive and defensive lines between home and away teams varied in each hockey game, accounting for TOI (per 60 minutes) was a key step before analysis.
- Strength of schedule: Considering how tough each team’s season was by examining their opponents’ records, particularly when considering offensive line quality, was essential. Strength of schedule adjustments could be made at team and line levels.
- Goalie quality: The provided data set did not contain an explicit goalie quality measure, but goal-tending is a critical part of hockey. In a simplification, each team had a single labeled goalie. A few goalie quality measures could have been created, such as save percentage (goal against / shoots against), or better yet, an xG-driven metric (e.g., goals against per 60 – xG against per 60).
As described under “tips for success” on the high school data science competition web page, it is often useful to outline approaches and map data organization needed for planned analyses before diving into the data itself.
Thoughtful incorporation of Generative AI tools
Most student teams used Generative AI tools– Large Language Models (LLMs) such as OpenAI’s ChatGPT, Google’s Gemini, or Anthropic’s Claude– to develop their submission to the high school data science competition. While these tools were permitted in the competition, it was clear when they were applied effectively.
In the best cases, student teams that used methods or results developed with the help of Generative AI tools went beyond the surface, verified output, and demonstrated clear understanding. Avoiding overreliance on these tools, and applying them thoughtfully– with clear oversight and in constrained stages of submission development rather than to complete an entire submission– set up student teams up for success. Taking the time to understand and learn about methods suggested by LLM tools before applying them may support competition performance and learning overall.
Generative AI is often most useful for brainstorming ideas, getting unstuck while coding, and improving communication of results. It can be less reliable for making key analytical decisions or interpreting model results—especially for those still learning the concepts.
The bottom line: Generative AI tools can be helpful to support thinking, not replace it. It’s always a good idea to check their outputs, make sure they fit each specific problem, and focus on building understanding along the way.
Creating a Model
One important takeaway from this competition is that in data science, there’s rarely a single “right” answer. While sometimes unsatisfying, it’s what makes data science powerful, creative, and open-ended.
This competition used fully simulated WHL ice hockey data from models we created, with fictional teams and stats. While there was no single “right” answer or ground truth in this year’s data, that doesn’t mean all models are equally effective. There are better approaches and better predictions. Here, we’ll outline the WAIAI team’s approach for developing more effective models.
Start Simple: Wins, Losses, and Point Differentials
The natural place to start was the win–loss record. And that’s a solid beginning—hockey teams with more wins are generally stronger. But digging a little deeper provides more insight: the margin of victory or point differential can be calculated for each game. Switching the outcome variable from win-loss to margin of victory offers a richer view of what happened in the game.
A basic model that uses average point differential per game can serve as a solid benchmark for more complex methods. It’s simple, interpretable, and surprisingly effective.
Build Complexity: Opponents, Efficiency, and Context
Stronger models go further by adjusting for context and other potentially meaningful factors, because not all wins are equal. Stronger models took steps to account for who teams played, not just how often they won. A win over a better team is different than one over a lower-tier opponent. This is where strength of schedule came in, with student teams adjusting their models using opponents’ records or building Elo-style rating systems that dynamically responded to wins and losses.
In addition to strength of schedule, some submissions also broke down team performance by calculating:
- Goals scored and allowed per 60 minutes – helping normalize stats for TOI
- Offensive and defensive efficiency – separating scoring ability from overall game outcomes and incorporating goalie quality
- Contextual factors like home vs. away games
Incorporating these elements into regression models allowed submissions to estimate team strength while controlling for real-world variability. These features can be included as controls in a regression model, and together with our efficiency-based measures for points scored/allowed, we can engineer a regression model to estimate each team’s offensive and defensive strength.
Assessing Predictions: Run Multiple Models
In data science, no single model presents the whole picture. Each one makes assumptions, highlights certain patterns, and has its own strengths and weaknesses. An ensemble model combines the predictions of multiple models. Instead of relying on one method, this approach brings together the strengths of several. Within the final five submissions in the competition, many employed multiple modeling approaches.
The WAIAI team also took this same approach, using an ensemble model to benchmark performance. Incorporating performance metrics such as expected goals, saves, and goals scored into account while also considering context such as time on ice and home ice advantage allowed student teams to estimate team strength while controlling for variability. Interestingly, the WAIAI team’s ensemble model matched very well with the ensemble of all submitted models.
What Statistical Models Did Student Teams Use in this Competition?
Among statistical models, here’s what stood out when we analyzed the 458 Phase 1 submissions. The best student teams often compared or utilized several models – taking an ensemble approach to the data:
- Logistic regression
- Ridge regression
- Bradley Terry models
- Poisson regression
- Elo model
Student Team Spotlights
Selecting the semi-finalists and finalists from many thoughtful and rigorous submissions was no easy task.
In Phase 1, the most successful student teams built accurate models, they told clear, convincing stories about how their models worked and why their decisions made sense within the limits of the submission form, and their data visualizations clearly communicated their key findings about team performance and offensive line quality disparity.
Phases 2 and 3 student team presentations blended statistical rigor with thoughtful design, communicating complex ideas in accessible ways. They blended rigorous statistical analysis with thoughtful narrative structure, walking the judges through their methodology step by step and describing their insights for the WHL Commissioner.
WANT TO SEE THEIR WORK?
Explore the top teams’ presentations and learn more about their methods.
Methods Submitted in Phase 1
Data Prep
First, we converted time-on-ice to minutes and removed empty-net data. Second, we aggregated events into game-level totals and reshaped the data so each game produced one row per team with opponent, home or away status, goals, shots, expected goals, and penalties. Finally, we bucketed events into even-strength, power-play, and penalty-kill splits for modeling. We created home/away indicators, opponent identifiers, even-strength unit labels, per-minute and per-60 xG rates, time-on-ice shares, and O1–O2 disparity ratios.
Software
We used R to clean and organize the data, combine game events into team summaries, and calculate key stats like xG-per-minute. We also used it to build statistical models and simulate game outcomes. AI tools helped us debug and write code, make modeling decisions, and better understand the statistical methods we were applying.
Statistical Methods
We used Bayesian mixed-effects regression models to estimate each team’s scoring rate per minute, adjusting for opponent strength by including both team and opponent effects. We modeled scoring on a log scale so differences represented proportional changes in scoring rates. We calculated home and away performance separately and then combined those results using time-on-ice weights. For matchup prediction, we calculated each game’s phase minutes using team-specific penalty rates to project power-play and penalty-kill time, with the remaining minutes assigned to even strength. We then applied Poisson goal distributions within each phase and ran 30,000 Monte Carlo simulations per game to estimate win probabilities.
Creating team power rankings and matchup win probabilities: We built rankings using opponent-adjusted xG rates, separating home and away performance across even-strength, power-play, and-penalty kill, then combining those results using time-on-ice weights. For matchup projections, we simulated penalties to allocate phase minutes, converted phase-specific xG rates into Poisson goal expectations, and ran 30,000 Monte Carlo simulations per game to estimate win probabilities.
Examining offensive line quality disparity: We isolated even-strength plays and estimated how many xG-per-minute each offensive line produced, while adjusting for the defensive pairings and opponents they faced. We calculated separate home and away scoring rates, combined them using time-on-ice weights, and measured offensive line disparity as the ratio of O1 scoring rate to O2 scoring rate for each team.
Visualizing offensive line quality disparity and team strength data: We plotted each team’s offensive line disparity against its overall team strength Z-score to compare lineup balance with total performance. We added league average reference lines and labeled quadrants to clearly show each team’s positioning. We also calculated the correlation, R², and p-value to quantify the strength and statistical significance of the relationship.
Model Evaluation
We assessed model performance by checking R-hat values and effective sample sizes to ensure the Bayesian models were stable. We confirmed that xG rates aligned with league averages. For simulations, we verified consistency by confirming that stronger teams consistently generated higher xG and higher win probabilities across repeated matchups.
Methods Submitted in Phase 1
Data Prep
We checked the 42,000 matchup rows for line names containing “PP” or goalie fields indicating an empty net, and filtered to 5 on 5 situations only, since power play and empty net could inflate team quality. We then aggregated these into 1,312 game level summaries of goals, expected goals, shots, and penalties. Using Ridge regression, we estimated defense adjusted expected goal ratings for each offensive line and defensive pairing. From the game level data, we derived save percentage, penalty differential, shot volume.
Software
We used Python as our primary programming language. Pandas handled data cleaning, scikit-learn implemented our Ridge regression, logistic regression, and Bradley-Terry models, scipy handled statistical analysis and optimization, and matplotlib created visualizations for feature importance and model performance. Claude assisted with code implementation and debugging. To share code between teammates, Google Colab was used throughout.
Statistical Methods
We employed Ridge regression on even strength rows to estimate offensive line quality while controlling for opponent defensive strength and TOI. Spearman’s correlation assessed feature relationships with season win percentage, where save percentage (ρ = 0.659) and xG differential (ρ = 0.533) showed high association. Logistic regression confirmed this, with its save percentage coefficient (+1.293) exceeding all other features, meaning goaltending carried more predictive weight than xG, Ridge ratings, penalties, and shot volume. We built a Bradley-Terry model to isolate individual goalie effects from team performance. We tested an ensemble model of Logistic Regression and Bradley-Terry, and found a 60/40 split produced the best results.
Creating team power rankings and matchup win probabilities: We ensembled a feature logistic regression (60%) with a Bradley-Terry goalie model (40%). For matchup probabilities, we input each home and away team’s season features into the ensemble to predict home win probability. Power rankings come from each team’s average predicted win probability against all 31 opponents on neutral ice, averaging home and away perspectives.
Examining offensive line quality disparity: Using Ridge regression, we modeled expected goals per matchup as a function of which offensive line and defensive pairing were on ice, controlling for opponent quality and weighting by TOI. This produced a defense adjusted xG coefficient for every line. We computed the ratio of first to second line coefficients to get that disparity.
Visualizing offensive line quality disparity and team strength data: We chose a scatter plot to show the relationship between offensive line disparity and team power rating. We used distinct shapes to separate the three tiers of teams. Key outlier teams are labeled directly on the plot. A trend line shows the overall relationship is weak, communicating that line balance alone does not guarantee success.
Model Evaluation
We split games 75/25 (984 training, 328 testing) and evaluated using Brier score, log loss, accuracy, and AUC. Our ensemble achieved Brier 0.235, log loss 0.663, accuracy 61.3%, and AUC 0.604, outperforming both component models individually. We alternatively built Elo, Poisson, and linear regression models, but none surpassed the ensemble.
Check out the 2026 Semifinalists Here
Check out the 2026 Finalists Here
- 2016 Chino Hills, The Pingry School, Basking Ridge, NJ, USA
- Cougars, Central Jersey College Prep Charter, Somerset, NJ, USA
- Reptalian Regressors, Jesuit High School, Portland, OR, USA
- Stem Wolves 2, Downingtown STEM Academy, Downingtown, PA, USA
- Version 1.0, Liberty High School, Frisco, TX, USA
