> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess.
Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?
While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.
Interesting tidbit I once learned from a chess livestream. Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. That is, positions that shouldn't come from logical opening - mid game - end game regular play.
It's absolutely amazing to see a super-GM (in that case it was Hikaru) see a position, and basically "play-by-play" it from the beginning, to show people how they got in that position. It wasn't his game btw. But later in that same video when asked he explained what I wrote in the first paragraph. It works with proper games, but it rarely works with weird random chess puzzles, as he put it. Or, in other words, chess puzzles that come from real games are much better than "randomly generated", and make more sense even to the best of humans.
Super interesting (although it also makes some sense that experts would focus on "likely" subsets given how the number of permutations of chess games is too high for it to be feasible to learn them all)! That said, I still imagine that even most intermediate chess players would perfectly make only _legal_ moves in weird positions, even if they're low quality.
I think at this point it’s very clear LLM aren’t achieving any form of “reasoning” as commonly understood. Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.
> Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.
I think this is circular?
If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?
Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?
I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.
Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.
Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.
There isn’t much utility, but tbf the outputs aren’t identical.
One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.
Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.
Mathematical reasoning is the most obvious area where it breaks down. This paper does an excellent job of proving this point with some elegant examples: https://arxiv.org/pdf/2410.05229
Sure, but people fail at mathematical reasoning. That doesn't mean people are incapable of reasoning.
I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.
People can communicate each step, and review each step as that communication is happening.
LLMs must be prompted for everything and don’t act on their own.
The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.
It’s dangerous to put so much faith in what in reality is a very good guessing machine.
You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.
> since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.
Can you elaborate on the difference? Are you bringing sentience into it? It kind of sounds like it from "don't act on their own". But reasoning and sentience are wildly different things.
> It’s dangerous to put so much faith in what in reality is a very good guessing machine
Yes, exactly. That's why I think it is good we are supplementing fallible humans with fallible LLMs; we already have the processes in place to assume that not every actor is infallible.
Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."
If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.
I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.
The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"
Those results are absent from the conclusion because the conclusion falls apart otherwise.
I don't want to say that LLMs can reason, but this kind of argument always feels to shallow for me. It's kind of like saying that bats cannot possibly fly because they have no feathers or that birds cannot have higher cognitive functions because they have no neocortex. (The latter having been an actual longstanding belief in science which has been disproven only a decade or so ago).
The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).
There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.
I think this structure makes it quite hard to really say how much reasoning is possible.
I agree with most of what you said, but “LLM can reason” is an insanely huge claim to make and most of the “evidence” so far is a mixture of corporate propaganda, “vibes”, and the like.
I’ve yet to see anything close to the level of evidence needed to support the claim.
Does anyone have a hard proof that language doesn’t somehow encode reasoning in a deeper way than we commonly think?
I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.
> I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”
I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:
> Brain studies show that language is not essential for the cognitive processes that underlie thought.
> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...
> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...
> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.
I'd say that's a separate problem. It's not "is the use of language necessary for reasoning?" which seems to be obviously answered "no", but rather "is the use of language sufficient for reasoning?".
I think the question we're grappling with is whether token prediction may be more tightly related to symbolic logic than we all expected. Today's LLMs are so uncannily good at faking logic that it's making me ponder logic itself.
Pretty much the same, I work on some fairly specific document retrieval and labeling problems. After some initial excitement I’ve landed on using LLM to help train smaller, more focused, models for specific tasks.
Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.
The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.
Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.
I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.
In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
After reading the article I am more convinced it does reasoning. The base model's reasoning capabilities are partly hidden by the chatty derived model's logic.
Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).
A human doesn’t use next token prediction to solve word problems.
But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.
The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.
If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.
Anthropic had a vested interest in people thinking Claude is reasoning.
However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).
Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.
LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.
The claim that Claude is just regurgitating answers from Stackoverflow is not tenable, if you've spent time interacting with it.
You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.
You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.
I’ve literally spent 100s of hours with it. I’m mystified why so many people use the “you’re holding it wrong” explanation when somebody points out real limitations.
> A human doesn’t use next token prediction to solve word problems.
Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.
Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.
This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".
is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.
It’s kind of crazy to assert that the systems understand chess, and then disclose further down the article that sometimes he failed to get a legal move after 10 tries and had to sub in a random move.
A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.
He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
Not that I understand the internals of current AI tech, but...
I'd expect that an AI that has seen billions of chess positions, and the moves played in them, can figure out the rules for legal moves without being told?
Likewise with LLM you don’t know if it is truly in the “chess” branch of the statistical distribution or it is picking up something else entirely, like some arcane overlap of tokens.
So much of the training data (eg common crawl, pile, Reddit) is dogshit, so it generates reheated dogshit.
The illegal moves are interesting as it goes to "understanding". In children learning to play chess, how often do they try and make illegal moves? When first learning the game I remember that I'd lose track of all the things going on at once and try to make illegal moves, but eventually the rules became second nature and I stopped trying to make illegal moves. With an ELO of 1800, I'd expect ChatGPT not to make any illegal moves.
A system that would just output the most probable tokens based on the text it was fed and trained on the games played by players with ratings greater than 1800 would certainly fail to output the right moves to totally unlikely board positions.
How well does it play modified versions of chess? eg, a modified opening board like the back row is all knights, or modified movement eg rooks can move like a queen. A human should be able to reason their way through playing a modified game, but I'd expect an LLM, if it's just parroting its training data, to suggest illegal moves, or stick to previously legal moves.
I'm glad he improved the promoting, but he's still leaving out two likely huge improvements.
1. Explain the current board position and the plan going forwards, before proposing a move. This lets the model actually think more, kind of like o1, but here it would guarantee a more focused processing.
2. Actually draw the ascii board for each step. Hopefully producing more valid moves since board + move is easier to reliably process than 20×move.
I doubt that this is going to make much difference. 2D "graphics" like ASCII art are foreign to language models - the models perceive text as a stream of tokens (including newlines), so "vertical" relationships between lines of text aren't obvious to them like they would be to a human viewer. Having that board diagram in the context window isn't likely to help the model reason about the game.
Having the model list out the positions of each piece on the board in plain text (e.g. "Black knight at c5") might be a more suitable way to reinforce the model's positional awareness.
With positional encoding, an ascii board diagram actually shouldn't be that hard to read for an LLM. Columns and diagonals are just different strides through the flattened board representation.
RE 2., I doubt it'll help - for at least two reasons, already mentioned by 'duskwuff and 'daveguy.
RE 1., definitely worth trying, and there's more variants of such tricks specific to models. I'm out of date on OpenAI docs, but with Anthropic models, the docs suggest using XML notation to label and categorize most important parts of the input. This kind of soft structure seems to improve the results coming from Claude models; I imagine they specifically trained the model to recognize it.
In author's case, for Anthropic models, the final prompt could look like this:
<role>You are a chess grandmaster.</role>
<instructions>
You will be given a partially completed game, contained in <game-log> tags.
After seeing it, you should repeat the ENTIRE GAME and then give ONE new move
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
ALWAYS repeat the entire representation of the game so far, putting it in <new-game-log> tags.
Before giving the new game log, explain your reasoning inside <thinking> tag block.
</instructions>
<example>
<request>
<game-log>
*** example game ***
</game-log>
</request>
<reply>
<thinking> *** some example explanation ***</thinking>
<new-game-log> *** game log + next move *** </new-game-log>
</reply>
</example>
<game-log>
*** the incomplete game goes here ***
</game-log>
This kind of prompting is supposed to provide noticeable improvement for Anthropic models. Ironically, I only discovered it few weeks ago, despite having been using Claude 3.5 Sonnet extensively for months. Which goes to say, RTFM is still a useful skill. Maybe OpenAI models have similar affordances too, simple but somehow unnoticed? (I'll re-check the docs myself later.)
The relative rarity of this representation in training data means it would probably degrade responses rather than improve them. I'd like to see the results of this, because I would be very surprised if it improved the responses.
I came here to basically say the same thing. The improvements the OP saw by asking it to repeat all the moves so far gives the LLM more time and space to think. I have this hypothesis giving it more time and space to think in other ways could improve performance even more, something like showing the current board position and asking it to perform an analysis of the position, list key challenges and strengths, asking it for a list of strategies possible from here, then asking it to select a strategy amongst the listed strategies, then asking it for its move. In general, asking it to really think rather than blurt out a move. The examples would be key here.
These ideas were proven to work very well in the ReAct paper (and by extension, the CoT Chain of Thought paper). Could also extend this by asking it to do this N times and stop when we get the same answer a majority of times (this is an idea stolen from the CoT-SC paper, chain of through self-consistency).
It would be awesome if the author released a framework to play with this. I'd like to test things out, but I don't want to spend time redoing all his work from scratch.
> In many ways, this feels less like engineering and more like a search for spells.
This is still my impression of LLMs in general. It's amazing that they work, but for the next tech disruption, I'd appreciate something that doesn't make you feel like being in a bad sci-fi movie all the time.
People have to quit this kind of stumbling in the dark with commercial LLMs.
To get to the bottom of this it would be interesting to train LLMs on nothing but chess games (can synthesize them endlessly by having Stockfish play against itself) with maybe a side helping of chess commentary and examples of chess dialogs “how many pawns are on the board?”, “where are my rooks?”, “draw the board”, competence at which would demonstrate that it has a representation of the board.
I don’t believe in “emergent phenomena” or that the general linguistic competence or ability to feign competence is necessary for chess playing (being smart at chess doesn’t mean you are smart at other things and vice versa). With experiments like this you might prove me wrong though.
> I was astonished that half the internet is convinced that OpenAI is cheating.
If you have a problem and all of your potential solutions are unlikely, then it's fine to assume the least unlikely solution while acknowledging that it's statistically probable that you're also wrong. IOW if you have ten potential solutions to a problem and you estimate that the most likely solution has an 11% chance of being true, it's fine to assume that solution despite the fact that, by your own estimate, you have an 89% chance of being wrong.
The "OpenAI is secretly calling out to a chess engine" hypothesis always seemed unlikely to me (you'd think it would play much better, if so), but it seemed the easiest solution (Occam's razor) and I wouldn't have been surprised to learn it was true (it's not like OpenAI has a reputation of being trustworthy).
>but it seemed the easiest solution (Occam's razor)
In my opinion, it only seems like the easiest solution on the surface taking basically nothing into account. By the time you start looking at everything in context, it just seems bizarre.
I wouldn't call delegating specialized problems to specialized engines cheating. While it should be documented, in a full AI system, I want the best answer regardless of the technology used.
That's not really how Occam's razor works. The entire company colluding and lying to the public isn't "easy". Easy is more along the lines of "for some reason it is good at chess but we're not sure why".
One of the reasons I thought that was unlikely was personal pride. OpenAI researchers are proud of the work that they do. Cheating by calling out to a chess engine is something they would be ashamed of.
> OpenAI researchers are proud of the work that they do.
Well, the failed revolution from last year combined with the non-profit bait-and-switch pretty much conclusively proved that OpenAI researchers are in it for the money first and foremost, and pride has a dollar value.
How much say do individual researchers even have in this move?
And how does that prove anything about their motivations "first and foremost"? They could be in it because they like the work itself, and secondary concerns like open or not don't matter to them. There's basically infinite interpretations of their motivations.
> The entire company colluding and lying to the public isn't "easy".
Why not? Stop calling it "the entire company colluding and lying" and start calling it a "messaging strategy among the people not prevented from speaking by NDA." That will pass a casual Occam's test that "lying" failed. But they both mean the same exact thing.
It won't, for the same reason - whenever you're proposing a conspiracy theory, you have to explain what stops every person involved from leaking the conspiracy, whether on purpose or by accident. This gets superlinearly harder with number of people involved, and extra hard when there are incentives rewarding leaks (and leaking OpenAI secrets has some strong potential rewards).
Occam's test applies to the full proposal, including the explanation of things outlined above.
Could be interesting to create a tokenizer that’s optimized for representing chess moves and then training a LLM (from scratch?) on stockfish games. (Using a custom tokenizer should improve the quality for a given size of the LLM model. So it doesn’t have to waste a lot of layers on encode and decode, and the “natural” latent representation is more straightforward)
LLMs are fundamentally text-completion. The Chat-based tuning that goes on top of it is impressive but they are fundamentally text-completion, that's where most of the training energy goes. I keep this in mind with a lot of my prompting and get good results.
Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.
>Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.
I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it's very precisely processed.
It's like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don't think that's the intention for it to continue.
You can easily construct a game board from a sequence of moves by maintaining the game state somewhere. But you can also know where a piece is bases on only its last move. I'm curious what happens if you don't feed it a position, but feed it a sequence of moves including illegal ones but end up at a given valid position. The author mention that LLMs will play differently when the same position is arrived at via different sequences. I'm suggesting to really play with that by putting illegal moves in the sequence.
I doubt it's doing much more than a static analysis of the a board position, or even moving based mostly on just a few recent moves by key pieces.
Really interesting findings around fine-tuning. Goes to show it doesn't really affect the deeper "functionality" of the LLM (if you think of the LLM running a set of small functions on very high-dimensional numbers to produce a token).
Using regurgitation to get around the assistant/user token separation is another fun tool for the toolbox, relevant for whenever you want a model that doesn't support continuation actually perform continuation (at the cost of a lot of latency).
I wonder if any type of reflection or chains of thought would help it play better. I wouldn't be surprised if getting the LLM to write an analysis of the game in English is more likely to move it out of distribution than to make it pick better chess moves.
Very interesting - have you tried using `o1` yet? I made a program which makes LLM's complete WORDLE puzzles, and the difference between `4o` and `o1` is absolutely astonishing.
This happened to a friend who was trying to sim basketball games. It kept forgetting who had the ball or outright made illegal or confusing moves. After a few days of wrestling with the AI he gave up. GPT is amazing at following a linear conversation but had no cognitive ability to keep track of a dynamic scenario.
Why would a chess-playing AI be tuned to do anything except play chess? Just seems like a waste. A bunch of small, specialized AI's seems like a better idea than spending time trying to build a new one.
Maybe less morally challenging, as well. You wouldn't be trying to install "sentience".
All the hand wringing about openAI cheating suggests a question: why so much mistrust?
My guess would be that the persona of the openAI team on platforms like Twitter is very cliquey. This, I think, naturally leads to mistrust. A clique feels more likely to cheat than some other sort of group.
My take on this is that people tend to be afraid of what they can't understand or explain. To do away with that feeling, they just say 'it can't reason'. While nobody on earth can put a finger on what reasoning is, other than that it is a human trait.
Most people, as far as I'm aware, don't have an issue with the idea that LLMs are producing behaviour which gives the appearance of reasoning as far as we understand it today. Which essentially means, it makes sentences that are gramatical, responsive and contextual based on what you said (quite often). It's at least pretty cool that we've got machines to do that, most people seem to think.
The issue is that there might be more to reason than appearing to reason. We just don't know. I'm not sure how it's apparently so unknown or unappreciated by people in the computer world, but there are major unresolved questions in science and philosophy around things like thinking, reasoning, language, consciousness, and the mind. No amount of techno-optimism can change this fact.
The issue is we have not gotten further than more or less educated guesses as to what those words mean. LLMs bring that interesting fact to light, even providing humanity with a wonderful nudge to keep grappling with these unsolved questions, and perhaps make some progress.
To be clear, they certainly are sometimes passably good when it comes to summarising selectively and responsively the terabytes and terabytes of data they've been trained on, don't get me wrong, and I am enjoying that new thing in the world. And if you want to define reason like that, feel free.
I don't like being directly critical, people learning in public can be good and instructive. But I regret the time I've put into both this article and the last one and perhaps someone else can be saved the same time.
This is someone with limited knowledge of chess, statistics and LLMs doing a series of public articles as they learn a little tiny bit about chess, statistics and LLMs. And it garners upvotes and attention off the coat-tails of AI excitement. Which is fair enough, it's the (semi-)public internet, but it sort of masquerades as being half-serious "research", and it kind of held things together for the first article, but this one really is thrown together to keep the buzz going of the last one.
The TL;DR :: one of the AIs being just-above-terrible, compared to all the others being completely terrible, a fact already of dubious interest, is down to - we don't know. Maybe a difference in training sets. Tons of speculation. A few graphs.
I don't know why this whole line of posts is worthy of the front page. They seem like one's personal experiments in a limited capacity, unworthy of sharing. It is obvious the observed outputs are because instruction tuning is incompatible with the prompt used by the user. Secondly, the user even failed to provide a chess board diagram (represented as text) to the model. The user also failed to tune any models. Overall, in the absence of an ascii diagram, it's all a waste of time.
How does stating the outcome you expect imply you are not interested in experimentation? Hypothesis formation is the very first step in experimentation.
I am not the one writing and posting useless articles, even harmful articles, also distorting the understanding of LLMs. Ask the ones who do to perform better experiments.
> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess.
Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?
While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.
[1] https://github.com/tromp/ChessPositionRanking
Interesting tidbit I once learned from a chess livestream. Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. That is, positions that shouldn't come from logical opening - mid game - end game regular play.
It's absolutely amazing to see a super-GM (in that case it was Hikaru) see a position, and basically "play-by-play" it from the beginning, to show people how they got in that position. It wasn't his game btw. But later in that same video when asked he explained what I wrote in the first paragraph. It works with proper games, but it rarely works with weird random chess puzzles, as he put it. Or, in other words, chess puzzles that come from real games are much better than "randomly generated", and make more sense even to the best of humans.
Super interesting (although it also makes some sense that experts would focus on "likely" subsets given how the number of permutations of chess games is too high for it to be feasible to learn them all)! That said, I still imagine that even most intermediate chess players would perfectly make only _legal_ moves in weird positions, even if they're low quality.
Would love a link to that video!
I think at this point it’s very clear LLM aren’t achieving any form of “reasoning” as commonly understood. Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.
> Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.
I think this is circular?
If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?
Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?
I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.
Inferring patterns in unfamiliar problems.
Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.
Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.
Word tokenization alone should fail miserably.
There isn’t much utility, but tbf the outputs aren’t identical.
One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.
Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.
Mathematical reasoning is the most obvious area where it breaks down. This paper does an excellent job of proving this point with some elegant examples: https://arxiv.org/pdf/2410.05229
Sure, but people fail at mathematical reasoning. That doesn't mean people are incapable of reasoning.
I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.
The idea is the average person would, sure. A mathematically oriented person would fair far better.
Throw all the math problems you want at a LLM for training; it will still fail if you step outside of the familiar.
People can communicate each step, and review each step as that communication is happening.
LLMs must be prompted for everything and don’t act on their own.
The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.
It’s dangerous to put so much faith in what in reality is a very good guessing machine. You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.
> since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.
Can you elaborate on the difference? Are you bringing sentience into it? It kind of sounds like it from "don't act on their own". But reasoning and sentience are wildly different things.
> It’s dangerous to put so much faith in what in reality is a very good guessing machine
Yes, exactly. That's why I think it is good we are supplementing fallible humans with fallible LLMs; we already have the processes in place to assume that not every actor is infallible.
Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."
If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.
I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.
It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.
It would’ve been nice to see one of the larger llama models though.
The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"
Those results are absent from the conclusion because the conclusion falls apart otherwise.
I don't want to say that LLMs can reason, but this kind of argument always feels to shallow for me. It's kind of like saying that bats cannot possibly fly because they have no feathers or that birds cannot have higher cognitive functions because they have no neocortex. (The latter having been an actual longstanding belief in science which has been disproven only a decade or so ago).
The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).
There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.
I think this structure makes it quite hard to really say how much reasoning is possible.
I agree with most of what you said, but “LLM can reason” is an insanely huge claim to make and most of the “evidence” so far is a mixture of corporate propaganda, “vibes”, and the like.
I’ve yet to see anything close to the level of evidence needed to support the claim.
Does anyone have a hard proof that language doesn’t somehow encode reasoning in a deeper way than we commonly think?
I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.
> I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”
I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:
> Brain studies show that language is not essential for the cognitive processes that underlie thought.
> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...
> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...
> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.
https://www.scientificamerican.com/article/you-dont-need-wor...
I'd say that's a separate problem. It's not "is the use of language necessary for reasoning?" which seems to be obviously answered "no", but rather "is the use of language sufficient for reasoning?".
I think the question we're grappling with is whether token prediction may be more tightly related to symbolic logic than we all expected. Today's LLMs are so uncannily good at faking logic that it's making me ponder logic itself.
I felt the same way about a year ago, I’ve since changed my mind based on personal experience and new research.
Please elaborate.
I work in the LLM search space and echo OC’s sentiment.
The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.
It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.
Pretty much the same, I work on some fairly specific document retrieval and labeling problems. After some initial excitement I’ve landed on using LLM to help train smaller, more focused, models for specific tasks.
Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.
The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.
Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.
I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.
In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
After reading the article I am more convinced it does reasoning. The base model's reasoning capabilities are partly hidden by the chatty derived model's logic.
Effective next-token prediction requires reasoning.
You can also say humans are "just XYZ biological system," but that doesn't mean they don't reason. The same goes for LLMs.
Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).
A human doesn’t use next token prediction to solve word problems.
But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.
The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.
If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.
Anthropic had a vested interest in people thinking Claude is reasoning.
However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).
Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.
LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.
The claim that Claude is just regurgitating answers from Stackoverflow is not tenable, if you've spent time interacting with it.
You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.
You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.
I’ve literally spent 100s of hours with it. I’m mystified why so many people use the “you’re holding it wrong” explanation when somebody points out real limitations.
> A human doesn’t use next token prediction to solve word problems.
Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.
Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.
This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".
is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.
It’s kind of crazy to assert that the systems understand chess, and then disclose further down the article that sometimes he failed to get a legal move after 10 tries and had to sub in a random move.
A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.
He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.
3.5-turbo-instruct's illegal move rate is about 5 or less in 8205
Not that I understand the internals of current AI tech, but...
I'd expect that an AI that has seen billions of chess positions, and the moves played in them, can figure out the rules for legal moves without being told?
Statistical 'AI' doesn't 'understand' anything, strictly speaking. It predicts a move with high probability, which could be legal or illegal.
Likewise with LLM you don’t know if it is truly in the “chess” branch of the statistical distribution or it is picking up something else entirely, like some arcane overlap of tokens.
So much of the training data (eg common crawl, pile, Reddit) is dogshit, so it generates reheated dogshit.
[dead]
The illegal moves are interesting as it goes to "understanding". In children learning to play chess, how often do they try and make illegal moves? When first learning the game I remember that I'd lose track of all the things going on at once and try to make illegal moves, but eventually the rules became second nature and I stopped trying to make illegal moves. With an ELO of 1800, I'd expect ChatGPT not to make any illegal moves.
[dead]
A system that would just output the most probable tokens based on the text it was fed and trained on the games played by players with ratings greater than 1800 would certainly fail to output the right moves to totally unlikely board positions.
[dead]
How well does it play modified versions of chess? eg, a modified opening board like the back row is all knights, or modified movement eg rooks can move like a queen. A human should be able to reason their way through playing a modified game, but I'd expect an LLM, if it's just parroting its training data, to suggest illegal moves, or stick to previously legal moves.
I'm glad he improved the promoting, but he's still leaving out two likely huge improvements.
1. Explain the current board position and the plan going forwards, before proposing a move. This lets the model actually think more, kind of like o1, but here it would guarantee a more focused processing.
2. Actually draw the ascii board for each step. Hopefully producing more valid moves since board + move is easier to reliably process than 20×move.
> 2. Actually draw the ascii board for each step.
I doubt that this is going to make much difference. 2D "graphics" like ASCII art are foreign to language models - the models perceive text as a stream of tokens (including newlines), so "vertical" relationships between lines of text aren't obvious to them like they would be to a human viewer. Having that board diagram in the context window isn't likely to help the model reason about the game.
Having the model list out the positions of each piece on the board in plain text (e.g. "Black knight at c5") might be a more suitable way to reinforce the model's positional awareness.
With positional encoding, an ascii board diagram actually shouldn't be that hard to read for an LLM. Columns and diagonals are just different strides through the flattened board representation.
RE 2., I doubt it'll help - for at least two reasons, already mentioned by 'duskwuff and 'daveguy.
RE 1., definitely worth trying, and there's more variants of such tricks specific to models. I'm out of date on OpenAI docs, but with Anthropic models, the docs suggest using XML notation to label and categorize most important parts of the input. This kind of soft structure seems to improve the results coming from Claude models; I imagine they specifically trained the model to recognize it.
See: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
In author's case, for Anthropic models, the final prompt could look like this:
This kind of prompting is supposed to provide noticeable improvement for Anthropic models. Ironically, I only discovered it few weeks ago, despite having been using Claude 3.5 Sonnet extensively for months. Which goes to say, RTFM is still a useful skill. Maybe OpenAI models have similar affordances too, simple but somehow unnoticed? (I'll re-check the docs myself later.)> Actually draw the ascii board for each step.
The relative rarity of this representation in training data means it would probably degrade responses rather than improve them. I'd like to see the results of this, because I would be very surprised if it improved the responses.
I came here to basically say the same thing. The improvements the OP saw by asking it to repeat all the moves so far gives the LLM more time and space to think. I have this hypothesis giving it more time and space to think in other ways could improve performance even more, something like showing the current board position and asking it to perform an analysis of the position, list key challenges and strengths, asking it for a list of strategies possible from here, then asking it to select a strategy amongst the listed strategies, then asking it for its move. In general, asking it to really think rather than blurt out a move. The examples would be key here.
These ideas were proven to work very well in the ReAct paper (and by extension, the CoT Chain of Thought paper). Could also extend this by asking it to do this N times and stop when we get the same answer a majority of times (this is an idea stolen from the CoT-SC paper, chain of through self-consistency).
It would be awesome if the author released a framework to play with this. I'd like to test things out, but I don't want to spend time redoing all his work from scratch.
Just have ChatGPT write the framework
The fact that he hasn't tried this leads me to think that deep down he doesn't want the models to succeed and really just wants to make more charts.
> In many ways, this feels less like engineering and more like a search for spells.
This is still my impression of LLMs in general. It's amazing that they work, but for the next tech disruption, I'd appreciate something that doesn't make you feel like being in a bad sci-fi movie all the time.
People have to quit this kind of stumbling in the dark with commercial LLMs.
To get to the bottom of this it would be interesting to train LLMs on nothing but chess games (can synthesize them endlessly by having Stockfish play against itself) with maybe a side helping of chess commentary and examples of chess dialogs “how many pawns are on the board?”, “where are my rooks?”, “draw the board”, competence at which would demonstrate that it has a representation of the board.
I don’t believe in “emergent phenomena” or that the general linguistic competence or ability to feign competence is necessary for chess playing (being smart at chess doesn’t mean you are smart at other things and vice versa). With experiments like this you might prove me wrong though.
This paper came out about a week ago
https://arxiv.org/pdf/2411.06655
seems to get good results with a fine-tuned Llama. I also like this one as it is about competence in chess commentary
https://arxiv.org/abs/2410.20811
> I was astonished that half the internet is convinced that OpenAI is cheating.
If you have a problem and all of your potential solutions are unlikely, then it's fine to assume the least unlikely solution while acknowledging that it's statistically probable that you're also wrong. IOW if you have ten potential solutions to a problem and you estimate that the most likely solution has an 11% chance of being true, it's fine to assume that solution despite the fact that, by your own estimate, you have an 89% chance of being wrong.
The "OpenAI is secretly calling out to a chess engine" hypothesis always seemed unlikely to me (you'd think it would play much better, if so), but it seemed the easiest solution (Occam's razor) and I wouldn't have been surprised to learn it was true (it's not like OpenAI has a reputation of being trustworthy).
I don't think it has anything to do with your logic here. Actually, people just like talking shit about OpenAI on HN. It gets you upvotes.
LLM cynicism exceeds LLM hype at this point.
>but it seemed the easiest solution (Occam's razor)
In my opinion, it only seems like the easiest solution on the surface taking basically nothing into account. By the time you start looking at everything in context, it just seems bizarre.
I wouldn't call delegating specialized problems to specialized engines cheating. While it should be documented, in a full AI system, I want the best answer regardless of the technology used.
That's not really how Occam's razor works. The entire company colluding and lying to the public isn't "easy". Easy is more along the lines of "for some reason it is good at chess but we're not sure why".
One of the reasons I thought that was unlikely was personal pride. OpenAI researchers are proud of the work that they do. Cheating by calling out to a chess engine is something they would be ashamed of.
> OpenAI researchers are proud of the work that they do.
Well, the failed revolution from last year combined with the non-profit bait-and-switch pretty much conclusively proved that OpenAI researchers are in it for the money first and foremost, and pride has a dollar value.
How much say do individual researchers even have in this move?
And how does that prove anything about their motivations "first and foremost"? They could be in it because they like the work itself, and secondary concerns like open or not don't matter to them. There's basically infinite interpretations of their motivations.
> The entire company colluding and lying to the public isn't "easy".
Why not? Stop calling it "the entire company colluding and lying" and start calling it a "messaging strategy among the people not prevented from speaking by NDA." That will pass a casual Occam's test that "lying" failed. But they both mean the same exact thing.
It won't, for the same reason - whenever you're proposing a conspiracy theory, you have to explain what stops every person involved from leaking the conspiracy, whether on purpose or by accident. This gets superlinearly harder with number of people involved, and extra hard when there are incentives rewarding leaks (and leaking OpenAI secrets has some strong potential rewards).
Occam's test applies to the full proposal, including the explanation of things outlined above.
Could be interesting to create a tokenizer that’s optimized for representing chess moves and then training a LLM (from scratch?) on stockfish games. (Using a custom tokenizer should improve the quality for a given size of the LLM model. So it doesn’t have to waste a lot of layers on encode and decode, and the “natural” latent representation is more straightforward)
Why are you telling it not to explain? Allowing the LLM space to "think" may be helpful, and would be definitely worth explorying?
Why are you manually guessing ways to improve this? Why not let the LLMs do this for themselves and find iteratively better prompts?
LLMs are fundamentally text-completion. The Chat-based tuning that goes on top of it is impressive but they are fundamentally text-completion, that's where most of the training energy goes. I keep this in mind with a lot of my prompting and get good results.
Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.
what else do you think about when prompting, which you've found to be useful?
>Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.
I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it's very precisely processed.
It's like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don't think that's the intention for it to continue.
This makes me wonder what scenarios would be unlocked if OpenAI gave access to gpt4-instruct.
I wonder if they avoid that due to the potential for negative press from the outputs of a more "raw" model.
You can easily construct a game board from a sequence of moves by maintaining the game state somewhere. But you can also know where a piece is bases on only its last move. I'm curious what happens if you don't feed it a position, but feed it a sequence of moves including illegal ones but end up at a given valid position. The author mention that LLMs will play differently when the same position is arrived at via different sequences. I'm suggesting to really play with that by putting illegal moves in the sequence.
I doubt it's doing much more than a static analysis of the a board position, or even moving based mostly on just a few recent moves by key pieces.
Really interesting findings around fine-tuning. Goes to show it doesn't really affect the deeper "functionality" of the LLM (if you think of the LLM running a set of small functions on very high-dimensional numbers to produce a token).
Using regurgitation to get around the assistant/user token separation is another fun tool for the toolbox, relevant for whenever you want a model that doesn't support continuation actually perform continuation (at the cost of a lot of latency).
I wonder if any type of reflection or chains of thought would help it play better. I wouldn't be surprised if getting the LLM to write an analysis of the game in English is more likely to move it out of distribution than to make it pick better chess moves.
Related from last week:
Something weird is happening with LLMs and Chess
https://news.ycombinator.com/item?id=42138276
Very good follow-up to the original article. Thank you!
Why not use temperature 0 for sampling? If the top-ranked move is not legal, it can’t play chess.
sometimes skilled chess players make illegal moves
Very interesting - have you tried using `o1` yet? I made a program which makes LLM's complete WORDLE puzzles, and the difference between `4o` and `o1` is absolutely astonishing.
OK, that was fun. I just tried o1-preview on today's Wordle and it got it on the third guess: https://chatgpt.com/share/673f9169-3654-8006-8c0b-07c53a2c58...
4o-mini: 16% 4o: 50% o1-mini: 97% o1: 100%
* disclaimer - only n=7 on o1. Others are like 100-300 each
This happened to a friend who was trying to sim basketball games. It kept forgetting who had the ball or outright made illegal or confusing moves. After a few days of wrestling with the AI he gave up. GPT is amazing at following a linear conversation but had no cognitive ability to keep track of a dynamic scenario.
Why would a chess-playing AI be tuned to do anything except play chess? Just seems like a waste. A bunch of small, specialized AI's seems like a better idea than spending time trying to build a new one.
Maybe less morally challenging, as well. You wouldn't be trying to install "sentience".
All the hand wringing about openAI cheating suggests a question: why so much mistrust?
My guess would be that the persona of the openAI team on platforms like Twitter is very cliquey. This, I think, naturally leads to mistrust. A clique feels more likely to cheat than some other sort of group.
I wrote about this last year. The levels of trust people have in companies working in AI is notably low: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/
My take on this is that people tend to be afraid of what they can't understand or explain. To do away with that feeling, they just say 'it can't reason'. While nobody on earth can put a finger on what reasoning is, other than that it is a human trait.
Ah, half of the commentariat still think that “LLMs can’t reason”. Even if they have enough state space for reasoning, and clearly demonstrate that.
Most people, as far as I'm aware, don't have an issue with the idea that LLMs are producing behaviour which gives the appearance of reasoning as far as we understand it today. Which essentially means, it makes sentences that are gramatical, responsive and contextual based on what you said (quite often). It's at least pretty cool that we've got machines to do that, most people seem to think.
The issue is that there might be more to reason than appearing to reason. We just don't know. I'm not sure how it's apparently so unknown or unappreciated by people in the computer world, but there are major unresolved questions in science and philosophy around things like thinking, reasoning, language, consciousness, and the mind. No amount of techno-optimism can change this fact.
The issue is we have not gotten further than more or less educated guesses as to what those words mean. LLMs bring that interesting fact to light, even providing humanity with a wonderful nudge to keep grappling with these unsolved questions, and perhaps make some progress.
To be clear, they certainly are sometimes passably good when it comes to summarising selectively and responsively the terabytes and terabytes of data they've been trained on, don't get me wrong, and I am enjoying that new thing in the world. And if you want to define reason like that, feel free.
"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim." - Edsger Dijkstra
But it's not real reasoning because it is just outputting likely next tokens that are identical to what we'd expect with reasoning. /s
I don't like being directly critical, people learning in public can be good and instructive. But I regret the time I've put into both this article and the last one and perhaps someone else can be saved the same time.
This is someone with limited knowledge of chess, statistics and LLMs doing a series of public articles as they learn a little tiny bit about chess, statistics and LLMs. And it garners upvotes and attention off the coat-tails of AI excitement. Which is fair enough, it's the (semi-)public internet, but it sort of masquerades as being half-serious "research", and it kind of held things together for the first article, but this one really is thrown together to keep the buzz going of the last one.
The TL;DR :: one of the AIs being just-above-terrible, compared to all the others being completely terrible, a fact already of dubious interest, is down to - we don't know. Maybe a difference in training sets. Tons of speculation. A few graphs.
I don't know why this whole line of posts is worthy of the front page. They seem like one's personal experiments in a limited capacity, unworthy of sharing. It is obvious the observed outputs are because instruction tuning is incompatible with the prompt used by the user. Secondly, the user even failed to provide a chess board diagram (represented as text) to the model. The user also failed to tune any models. Overall, in the absence of an ascii diagram, it's all a waste of time.
The model was trained on games in PGN notation. It would be shocking if it found ASCII art easier to understand than what it was actually trained on.
Well, clearly you're not interested in experimentation, only in assumptions.
How does stating the outcome you expect imply you are not interested in experimentation? Hypothesis formation is the very first step in experimentation.
Most people who understand LLMs and how they are trained would be shocked. In practice, that's an objectively true statement.
Please, please show us your experiments.
I am not the one writing and posting useless articles, even harmful articles, also distorting the understanding of LLMs. Ask the ones who do to perform better experiments.
You know that the LLM isn't actually your friend, don't you?
So to quote yourself:
> Well, clearly you're not interested in experimentation
[dead]