Evaluating LLMS

To follow along with this tutorial, check out Sudoku-First Evals.ipynb in the Sudoku Repo.

In the last article, we introduced 4x4 Sudoku as an interesting task for better understanding LLMs, and shown how to prompt a model on the task. In this article, let's get a baseline of OpenAI models on the task before we turn to fine-tuning.

Baselines of GPT-3.5 and GPT-4 on 4x4 Sudoku

Before we begin training GPT-3.5 on sudoku, it is useful to first get a baseline. So, in this article, we will be focusing on evaluating GPT-3.5 and GPT-4-turbo on 4x4 Sudoku puzzles without any special training. This challenge not only tests their logical reasoning and spatial understanding but also may give us insights into their problem-solving methodologies.

Evaluation Set Construction

First, we need to build an evaluation set. An evaluation set, or test set, is data that we hold out for assessing the performance of our models. We won’t use this data for training.

Our evaluation set will comprise 100 samples of 4x4 Sudoku puzzles at varying levels of completion. Puzzles are created by beginning with a solved puzzle and gradually removing cells. Puzzles with fewer solved cells, indicating a higher difficulty, are given more weight in our selection process. This approach aims to create a challenging benchmark for the models. We will save our evaluation set for future use.

Here's the code:

eval_set = []
for i in range(100):
    solution = construct_puzzle_solution()
    _, _, history = pluck(deepcopy(solution))
    puzzle = weighted_random_choice(history[1:])
    eval_set.append((puzzle, solution))

After we construct the puzzle, we save it to a file for later usage:

json_string = json.dumps({'eval':eval_set})

Testing Methodology

For testing and training, I'd like to introduce a slightly more complex prompt that I call the brief_analysis prompt. Just like our previous prompt, this prompt requires the models to analyze a given puzzle, solve a single cell, and articulate their reasoning process in natural language. The brief_analysis prompt additionally requests that the model act like a sudoku tutor, is more detailed on desired output, and provides an example. I've found this prompt to have slightly better performance and so I prefer it.

We are working on the following sudoku puzzle (each sub-list represents a row):

You are a sudoku tutor. Create a brief analysis that finds an unsolved cell and solves it. Do not repeat the puzzle (which the student has seen). Just solve one cell that currently has a zero. I suggest you start by examining which rows, columns, or regions have the most cells already solved. You can use this to identify one or more cells that are not currently solved but may be solvable from the available information. Then identify the solution to that cell.

Your analysis must then solve ONLY one cell by replacing 0 with the correct number. Please don't include the puzzle in your analysis, we will provide that to the student separately.

Example puzzle: [[0, 0, 0, 0], [0, 0, 3, 2], [1, 0, 0, 0], [2, 0, 1, 4]]

Your analysis could look like this: The row with the most solved cells is row 4 with numbers: 1 2 and 4. Because each row must contain the digits 1-4, the unsolved cell must be 3. Therefore row 4, column 2 is the number 3.

Let's start by loading the evaluation set:

with open('./data/evalset', 'r') as file:
    content = file.read()
eval_set_loaded = json.loads(content)['eval']

Following that, we load the model to test and an extraction model:

# Model to test
model_name = "gpt-3.5-turbo-1106"
model = ChatOpenAI(model=model_name)
model.temperature = 0.0

#Extraction Model
extraction_model = ChatOpenAI(model="gpt-4-0613")

We use a model temperature of 0 to minimize randomness that may make it harder to compare evaluations on the same evaluation set. Temperature is a parameter that determines the randomness or "creativity of a model.¹

Please note that even at temperature zero, OpenAI models are not deterministic–-they don’t always give the same answer.² The extraction model is used to normalize the answer provided by gpt-3.5-turbo to one that can be read by a computer.

We run a loop to evaluate the model on each evaluation set sample. Since we will be doing several hundred calls to the models, we use get_openai_callback() to make sure we see the total cost at the end of this operation. Here's the code:

outcomes = []

with get_openai_callback() as cb:
    for sample in eval_set_loaded:   
        # Get puzzle and solution
        puzzle = sample[0]
        solution = sample[1]
        prompt = brief_analysis.format(puzzle)
        message = model.invoke([   
            HumanMessage(content=prompt)
        ])
        reasoning = message.content
        try:
            proposed = analysis_to_puzzle_solution(extraction_model, puzzle, reasoning)
            result = "correct" if is_proposed_solution_valid(puzzle, solution, proposed) else "incorrect"
        except:
            proposed = puzzle
            result = 'incorrect'
            reasoning = "The model failed to produce a parseable response."
    
    
        outcomes.append(({'result': result, 'proposed': proposed, 'puzzle': puzzle, 'reasoning': reasoning}))   
        clear_output(wait=True)
        display_outcomes(outcomes)
    print(cb)

As shown, we create "outcomes" to hold the results. We iterate over the samples, loading the puzzle and solution for each sample. We form the prompt using the "brief_analysis" template prompt, and then send it off to the model. After receiving the models response, we parse the reasoning into a "proposed" solution.

If the evaluation model is unable to parse the reasoning (for example, if the response was off topic), then the sample's result is marked "incorrect" and the reasoning is updated to reflect that the model failed to produce a parseable response. If the evaluation model succeeds in parsing the request, the result is determined.

We then write the results of the sample, along with the proposed solution, the puzzle, and the reasoning to the outcomes. So the user may easily view the results, we clear the output and display the outcomes in a user-friendly interface that lets the user browse the results of each sample.

GPT-3.5-turbo-1106 ⦁ 50 out of 100 correct

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

Click on an example to see details here.

Run on GPT-4-turbo

Using an identical process as before, we can also run gpt-4-turbo on the evaluation set. GPT-4-turbo does a fair bit better scoring 65 out of 100 to gpt-3.5’s 50 out of 100.

GPT-4 Turbo ⦁ 65 out of 100 correct

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

Click on an example to see details here.

Can we solve a full puzzle?

Now that we’ve shown the models can solve an individual cell at least half of the time, can they solve a full puzzle? Probably not with only a 50% and 65% accuracy, but let’s try anyway.

First, let’s create a function to check if we’ve got a fully solved puzzle. This will allow us to know when we can stop trying to solve additional cells.

def is_fully_solved(puzzle):
    for x in puzzle:
        for y in x:
            if y == 0:
                return False
    return True

Let's now try to solve a bunch of 4x4 sudoku puzzles with GPT-4 turbo. We will try 20 puzzles. For each game we construct a puzzle, and then begin iterating over moves. Since each puzzle should be solvable in 11 moves, we only iterate for 12 moves.

For each attempt, we format the brief_analysis template, then invoke the language model. We extract the proposed solution from the reasoning using the extraction_model. If the proposed solution is valid using the is_proposed_solution_valid(), we mark it correct. If the model fails to produce a parseable response, or is wrong, we set the result as "incorrect" and appends an error message to the reasoning. If the puzzle is fully solved, it sets the result as "solved". We append the current puzzle state, reasoning, and result to the moves list. If the result is "solved" or "incorrect", we can breaks out of the inner loop and be finished with that game. After the inner loop ends, we append the game dictionary to the games list.

games = []
for i in range(20): 
    moves = []
    solution = construct_puzzle_solution()
    puzzle, _, _ = pluck(deepcopy(solution))
    moves.append((puzzle, "", "correct"))
    for j in range(12):
        prompt = brief_analysis.format(puzzle)
        message = model.invoke([   
            HumanMessage(content=prompt)
        ])
        reasoning = message.content
        try:
            proposed = analysis_to_puzzle_solution(extraction_model, puzzle, reasoning)
            clear_output(wait=True)
            display_sudoku_comparison(proposed, puzzle)
            result = "correct" if is_proposed_solution_valid(puzzle, solution, proposed) else "incorrect"
            if proposed:
                puzzle = proposed
            
        except:
            result = 'incorrect'
            reasoning += " The model failed to produce a parseable response."
            proposed = puzzle
        if is_fully_solved(puzzle):
            result = "solved"
        moves.append((puzzle, reasoning, result))   
        if result == "solved" or result == "incorrect":
            break
        puzzle = proposed
    
    games.append({'move' : moves})

When run, GPT-4-turbo is not able to fully solve any puzzle:

GPT-4 Turbo - Full Solve Attempts

▶

Click on a step to see details here.

This is not great, so there is real room for improvement from fine-tuning. We will turn to that next.

Conclusion

We've looked at how to evaluate models on 4x4 sudoku and now have a baseline of performance for the models. Even GPT-4-turbo lacks sufficient accuracy to finish a single 4x4 sudoku puzzle, motivating us to try fine-tuning. We will turn to that in our next article. Stay tuned!

Here's an interesting discussion: Is Temperature the Creativity Parameter of Large Language Models? ↩
There is speculation that OpenAI's zero temperature is non-deterministic due to their mixture-of-experts design https://twitter.com/morgymcg/status/1749841027349827989 ↩