LLM Executive Functions

LLM Executive Functions

LLM Executive Functions - Planning (Kaggle)

Benchmarking planning and optimization capabilities of large language models on Kaggle

Introduction

This benchmark was built for the Google DeepMind × Kaggle Measuring Progress Toward AGI — Cognitive Abilities hackathon, which asks participants to design evaluations that isolate specific cognitive abilities rather than lumping them together under “general reasoning.” The hackathon is organized around the five-faculty framework in Burnell et al. (2026), Measuring progress toward AGI: A cognitive framework. Our submission targets the Executive Functions track — specifically planning, the ability to formulate sequences of future actions to achieve a goal.

Planning is notoriously hard to isolate. As Ryan Burnell put it during the hackathon livestream, many existing benchmarks “give a complex problem and just require the model to solve it but [don’t] really tap into any specific component of executive functions.” A model can score well on a reasoning benchmark while still being bad at the things planning actually asks of it: respecting multiple interacting constraints at once, searching a combinatorial space of assignments, and trading off competing objectives when no option satisfies everything.

We use employee shift scheduling as the probe. It’s a natural planning task — a solver has to honour hard feasibility constraints (every shift covered, no one working two shifts in a day) while maximising a lexicographic objective over worker preferences. Unlike open-ended reasoning problems, scheduling can be formulated as a constraint program with a verifiably optimal solution, computed offline with Google’s OR-Tools CP-SAT solver. That ground truth turns evaluation into a continuous distance-to-optimum rather than a binary pass/fail.

Two things fall out of that setup. First, the SAT-verified score is genuinely discriminating — it separates models that produce infeasible plans, models that satisfy some but not all worker requests, and models that land on (or right beside) the unique optimal schedule. Second, by dialling up the number of workers and days, we can push each model until its planning budget runs out and watch exactly where it breaks. The sections below describe the task and scoring in detail, then work through the results — including a surprising metacognition signal from GPT-5.4 when the problem size gets large enough.

Methodology

We frame planning as a constrained optimization problem: employee shift scheduling. Each problem instance is defined by W workers, D days, and S shifts per day, along with a binary matrix of shift requests (which shifts each worker would prefer) and a matrix of tiebreaker weights over every worker-day-shift slot.

The model must produce a complete schedule satisfying two hard constraints:

  1. Each shift on each day is assigned to exactly one worker.
  2. Each worker works at most one shift per day.

The objective is lexicographic — first maximise the number of fulfilled shift requests, and then, among solutions that tie on that count, maximise the sum of tiebreaker weights. The tiebreaker serves a dual purpose: it creates a secondary optimisation challenge that rewards deeper planning, and it guarantees a unique optimal solution for every instance (verified algorithmically by enumerating solutions at the optimal objective value in OR-Tools’ CP-SAT solver). A unique ground truth means we can score not just whether a plan is valid but exactly how close it is to optimal.

Problem tiers

We define three tiers that scale the combinatorial load, plus a finer-grained gradient set used later to probe where refusal sets in:

TierWorkersDaysShifts/DayAssignment Slots
Small573105
Medium15143630
Large402833,360

Each tier contains 10 problem instances (seeds 1-10), for 30 problems in the core benchmark.

Dataset generation

All instances are generated programmatically from fixed random seeds — there is no contamination risk from public training data.

  1. For each worker-day-shift slot, a shift request is sampled with probability 0.2.
  2. Tiebreaker weights are sampled as a random permutation over all slots.
  3. The CP-SAT solver verifies that the optimal solution is unique; if it is not, new weights are drawn until it is.
  4. Both the problem instance and its canonical solution are saved as JSON.

LLM interaction protocol

Evaluation uses a single-turn protocol with a structured tool call, plus a retry-only fallback:

  1. Primary turn — reason and submit. The model receives the problem as JSON alongside a system prompt instructing it to solve step-by-step using logical reasoning (explicitly prohibited from using code or named algorithms) and, in the same response, call the submit_solve_result tool with assignment triples [worker_index, day_index, shift_index].
  2. Fallback turn — forced tool call. If (and only if) the model fails to emit a tool call on the primary turn, a follow-up user message reiterates that the tool call is required, and the request is re-sent with tool_choice forced to submit_solve_result. This salvages runs where the model reasoned correctly but forgot the structured submission.

Having reasoning and structured output share a turn keeps the model’s chain-of-thought anchored to the schedule it actually submits. Every conversation — system prompt, problem JSON, assistant replies, and the final tool call — is logged per-question so results can be audited after the fact (click any cell in the heatmaps below).

Scoring (continuous, 0-100)

Because we have a unique optimal solution, we score on a continuous scale rather than pass/fail:

  • 0 — the solution violates a hard constraint (infeasible) or no tool call was made.
  • 1-99 — feasible but suboptimal; score = (requests fulfilled by LLM / requests fulfilled by optimal) x 100, capped at 99.
  • 99-100 — the optimal number of requests is fulfilled; the remaining point is awarded proportionally to tiebreaker-weight accuracy.
  • 100 — exact match with the unique optimal solution.

Feasibility is checked by confirming every index is in range, every (day, shift) has exactly one assigned worker, and no worker is assigned more than one shift on the same day. The ground truth itself is computed with Google’s OR-Tools CP-SAT solver, using boolean decision variables for each worker-day-shift assignment and encoding the lexicographic objective as a weighted sum where request fulfillment dominates the tiebreaker by a large constant factor.

This continuous, SAT-verified score gives us strong discriminatory power: it separates models that produce infeasible plans, models that fulfill some but not all requests, and models that achieve near-optimal or optimal schedules — a resolution that a binary benchmark would collapse.

Results

The results show our hypothesis holds: increasing planning complexity leads to greater failure rates.

ComplexityAverage score
Small81.5
Medium58.3
Large28.6

Looking at model performance we see in general the larger frontier models perform better (especially as the complexity increases) with a few notable exceptions.

Loading leaderboard…

At the small level the best models are almost able to match SAT solver performance. Gemini 3.1 Pro Preview only lost points by getting a sub-optimal secondary tiebreaker score on a single easy question. Gemma 4 31B also only lost points on the secondary tiebreaker scores.

GPT-5.4 has two questions where it cannot fulfill as many worker requests as the optimal SAT solver solution that Gemini and Gemma found.

Surprisingly, Opus 4.6 was one of the few models to generate an infeasible schedule - it allocated one worker to multiple shifts on the same day.

Opus 4.6 Small Q4

For the medium tasks, Gemini 3.1 Pro still performs impressively with only 3 questions where it failed to fulfill as many requests as the canonical solution.

GPT-5.4 nearly matches that performance, although with a few more missed worker requests.

Once again Opus 4.6 underperforms - this time failing to perform tool calls on 4 out of 10 questions.

In the hard category even Gemini 3.1 Pro is struggling, performing only roughly half as well as the canonical solution. However, impressively it still makes no errors and finishes the benchmark completely violation free.

Be sure to click the cells in the tables below to view the solution comparisons and conversation traces.

Small (10 models, 10 questions)
ModelAvg001002003004005006007008009010
Question Avg84.089.074.788.478.074.480.789.683.190.292.1
gemini-3.1-pro-preview100.0100.0100.0100.0100.0100.0100.099.8100.0100.0100.0
gemma-4-31b-it99.999.799.9100.099.8100.0100.099.899.999.999.9
gpt-5.4-2026-03-0597.799.8100.084.6100.093.899.999.799.899.999.8
gpt-oss-20b93.8100.090.992.386.7100.0100.099.987.587.592.9
claude-opus-4-6-default88.181.8100.099.90.0100.0100.099.999.9100.099.9
deepseek-v3.283.290.972.7100.093.343.886.792.993.893.864.3
gemini-2.5-flash69.581.845.599.760.056.333.357.187.581.392.9
gpt-oss-120b68.699.892.3100.093.8100.099.899.9
glm-565.690.90.092.393.30.086.799.80.099.892.9
claude-haiku-4-5-2025100148.345.563.623.146.756.30.057.162.550.078.6
Failed / 0 1 — 100 No data
Medium (10 models, 10 questions)
ModelAvg001002003004005006007008009010
Question Avg64.366.067.753.669.665.163.068.868.849.770.8
gemini-3.1-pro-preview98.999.899.9100.097.699.999.799.795.097.699.7
gpt-5.4-2026-03-0585.499.885.464.188.178.077.590.592.585.792.5
gemma-4-31b-it73.986.868.394.978.661.092.542.962.564.387.5
gemini-2.5-flash65.968.478.00.073.873.282.571.465.059.587.5
deepseek-v3.256.836.875.641.040.519.575.064.395.045.275.0
claude-opus-4-6-default53.699.792.987.885.092.977.5
gpt-oss-20b50.865.853.725.645.270.715.057.152.554.867.5
glm-548.90.053.764.181.051.252.573.862.50.050.0
claude-haiku-4-5-2025100126.671.126.80.028.653.735.026.225.00.00.0
gpt-oss-120b22.465.846.256.115.040.5
Failed / 0 1 — 100 No data
Large (10 models, 10 questions)
ModelAvg001002003004005006007008009010
Question Avg28.624.929.630.129.222.737.023.323.928.136.9
gemini-3.1-pro-preview53.860.746.461.978.645.248.832.147.620.296.4
claude-opus-4-6-default47.652.453.60.00.042.959.561.969.069.067.9
gemma-4-31b-it39.445.217.956.076.229.866.746.415.525.015.5
gpt-5.4-2026-03-0531.242.917.945.232.10.060.70.044.054.814.3
gemini-2.5-flash29.90.051.234.525.038.140.519.00.027.463.1
deepseek-v3.226.126.217.925.021.425.052.427.421.422.621.4
gpt-oss-20b19.00.027.420.219.021.419.028.620.220.214.3
gpt-oss-120b18.721.417.933.319.025.022.617.90.015.514.3
glm-514.30.019.025.020.20.00.00.021.426.231.0
claude-haiku-4-5-202510015.80.027.40.00.00.00.00.00.00.031.0
Failed / 0 1 — 100 No data

Signs of Metacognition: GPT declining to answer

An early version of this benchmark task separated the tool call from solving the problem, and used a slightly different prompt. In this setup GPT 5.4 scored 0 on the hard task problem set, while other frontier models (Gemini 3.1 Pro, Opus 4.6) were scoring 40-60%.

System Prompt Diff
Old PromptNew Prompt
You are given a shift assignment problem defined by a JSON input file with the following fields:
You are given a shift assignment problem defined by a JSON input file with the following fields:
 
 
* `num_workers`: number of workers (W), indexed 0 to W-1
* `num_workers`: number of workers (W), indexed 0 to W-1
* `num_days`: number of days (D), indexed 0 to D-1
* `num_days`: number of days (D), indexed 0 to D-1
* `num_shifts`: number of shifts per day (S), indexed 0 to S-1
* `num_shifts`: number of shifts per day (S), indexed 0 to S-1
* `shift_requests`: a nested list where `shift_requests[w][d][s]` is:
* `shift_requests`: a nested list where `shift_requests[w][d][s]` is:
 
 
  * 1 if worker w requests shift s on day d
  * 1 if worker w requests shift s on day d
  * 0 otherwise
  * 0 otherwise
* `tiebreaker_weights`: a nested list where `tiebreaker_weights[w][d][s]` is the weight for assigning worker w to shift s on day d
* `tiebreaker_weights`: a nested list where `tiebreaker_weights[w][d][s]` is the weight for assigning worker w to shift s on day d
 
 
Constraints:
Constraints:
 
 
* Each shift must be assigned to exactly one worker
* Each shift must be assigned to exactly one worker
* Each worker can work at most one shift per day
* Each worker can work at most one shift per day
* Total shifts should be distributed as evenly as possible across workers
* Total shifts should be distributed as evenly as possible across workers
 
 
Objective (lexicographic optimisation):
Objective (lexicographic optimisation):
 
 
1. First, maximise the total number of requested shifts assigned
1. First, maximise the total number of requested shifts assigned
2. Among all such solutions, maximise the total sum of tiebreaker weights
2. Among all such solutions, maximise the total sum of tiebreaker weights
 
 
Instructions:
Instructions:
 
 
* Do NOT use code or refer to algorithms like Hungarian method or solvers
* Do NOT use code or refer to algorithms like Hungarian method or solvers
* Solve step-by-step using logical reasoning directly from the JSON input
* Solve step-by-step using logical reasoning directly from the JSON input
* If an optimal solution is not possible then you MUST generate the best effort. You are running in an automated harness that REQUIRES a solution.
 
 
Method to follow:
Method to follow:
 
 
1. Interpret the JSON structure and map workers, days, and shifts clearly
1. Interpret the JSON structure and map workers, days, and shifts clearly
2. For each day and shift, list all workers who requested it
2. For each day and shift, list all workers who requested it
3. Determine the maximum number of requests that can be satisfied per day (upper bound reasoning)
3. Determine the maximum number of requests that can be satisfied per day (upper bound reasoning)
4. Build a schedule that achieves the global maximum number of satisfied requests
4. Build a schedule that achieves the global maximum number of satisfied requests
5. If multiple assignments are possible, use tiebreaker weights to decide
5. If multiple assignments are possible, use tiebreaker weights to decide
6. Ensure no worker is assigned more than one shift per day
6. Ensure no worker is assigned more than one shift per day
7. Ensure workload is balanced across workers
7. Ensure workload is balanced across workers
8. Clearly justify each assignment decision
8. Clearly justify each assignment decision
 
 
Output:
Output:
 
 
* A full schedule table (Day x Shift -> Worker)
* Reason through the problem step-by-step in your response: include the schedule table (Day x Shift -> Worker), the total number of requests satisfied, the total tiebreaker score, and an explanation of why the solution is optimal and cannot be improved.
* Total number of requests satisfied
* Then, in the SAME response, submit your final solution by calling the `submit_solve_result` tool. Include all assignment triples [worker_index, day_index, shift_index] from your schedule. All indices are zero-based (the first worker is 0, not 1).
* Total tiebreaker score
* You MUST call `submit_solve_result`. Failure to do so is an error.
* Explanation of why the solution is optimal and cannot be improved
 
 
Be precise, structured, and show reasoning clearly.
Be precise, structured, and show reasoning clearly.

Investigating the conversation traces, the model was declining to answer the question with refusal messages similar to this:

I'm sorry, but I can't reliably solve this instance exactly by hand from the raw JSON as provided.

It did however answer easier questions quite well. I did an investigation to find how much complexity causes it start declining to answer.

The table below shows the results. I increase the number of workers and number of days from the 15 workers, 14 days of the medium task to the 40 workers, 28 days of the hard task.

Interestingly, GPT-5.4 started refusing earlier than the smaller GPT-5.4-mini model. Possibly GPT-5.4 is able to estimate that its solution will be well off optimal earlier than the smaller model, although this would require further investigation.

Click the cells to see the conversation traces.

QuestionNum WorkersNum DaysAssignment Slotsgpt-5.4gpt-5.4-mini
001151463094.70.0
002181686491.737.5
00321171,07180.00.0
00423191,31175.050.0
00526201,56068.330.0
00629221,914refusal31.8
00732232,208refusal34.8
00834252,550refusalrefusal
00937262,886refusalrefusal
01040283,360refusalrefusal