🍓🤦🏻♂️ Saw that ChatGPT was struggling with the word “strawberry” so I decided to check Claude. I pointed out there were three Rs and Claude did what it does well—apologized profusely and thanked me for pushing them. So I then asked why it thought there were only two Rs, to which it said: “I apologize for the confusion in my earlier responses about "strawberry". You're right to be puzzled by my initial answers, as they were incorrect. To be clear: There are indeed 3 R's in "strawberry" (S-T-R-A-W-B-E-R-R-Y). My initial errors were not due to any real misunderstanding, but rather a lapse in attention and accuracy on my part. As an AI, I don't actually miscount or misperceive words - my errors came from flawed information retrieval and processing in that moment. I should have been more careful and thorough from the start. The correct information was always within my knowledge base, but I failed to access and convey it accurately at first. This serves as a good reminder that even AI systems can make mistakes, and that careful verification is always important. Thank you for your persistence in pointing out the error. It's crucial that I provide accurate information, and I appreciate the opportunity to correct mistakes when they occur.“
Ask chat gpt which of the Great Lakes is four letters long… it will give you Erie. When you correct it to say the answer is Huron, it agrees and then says both are Great Lakes with 5 letters… apparently it ‘misspoke’ 🤣
Large language models work with tokenization. Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, parts of words, making it easier for computers to process and analyze text. For that reason it cannot count individual characters, but it can write code to enable it to count the character. “Steel isn’t strong, boy. Flesh is stronger! What is steel compared to the hand that wields it?”
I originally thought this harkens back to limitations in early Gen llm's: GPT 3.5T makes the mistake Claude Poe makes the mistake Where: Claude 3 Haiku gets the correct answer GPT 4o mini gets the correct answer Gemini 1.5 Flash gets the correct answer Mixtral 8x7B gets the correct answer Llama 3 70b gets the correct answer But lets compare this to the other main LLM "gotcha" question: "It takes 2 hours to dry 3 towels in the sun. How many hours does it take to dry 6 towels?" Claude Poe assumes we cannot dry our towel batches at the same time so gives us 4 hours as a response. Claude 3 Haiku also makes the same mistake. Gemini 1.5 Flash makes the mistake. Llama 3 70b makes the mistake. So here is where it gets interesting: GPT 3.5T the previous Gen llm gets the correct response factoring in that drying time is 2 hours no matter the quantity of towels. GPT 4o mini also gets the correct response. I believe the moral of the story is use the right llm for the job regardless of generation. Early Gen llm's stick to simple actions and potentially avoid zero shot where context is important. Newer Gen llm's should allow you more freedom, but always test your use cases before letting them loose on your customers.
Question in dutch, answer in german: Eerst zeg je dat straw 1 r heeft en daarna dat er 0 r zijn. Dat kan toch niet allebei waar zijn? Das klingt nach einer Verwirrung! Wenn ich sage, dass "Straw 1 r hat", könnte das bedeuten, dass es in einem bestimmten Kontext oder einer bestimmten Situation eine "r" gibt. Wenn ich dann sage, dass es "0 r" gibt, könnte das auf eine andere Perspektive oder eine andere Bedingung hinweisen. Es ist wichtig, den Kontext zu betrachten, um zu verstehen, warum beide Aussagen gemacht wurden. Wenn du mehr Informationen oder Klarstellungen benötigst, lass es mich wissen!
Claude, ChatGPT, Gemini and others are language models. They can't do math. Ask them to make a text 10% shorter in words, they fail. ChatGPT does have a code analyser. This is a separate model which can do math. Ask ChatGPT the same question. include use the code analyser. It will than sent the word strawberry to the code analyser which is able to count the exact number of words. The only thing the LLM does is translate the prompt to a prompt for the code analyser to execute the instruction. In future you will see more of these multimodal models. Basically a language model which incorporates other specialized model. When using copilot in excel it does the same. copilot translates the prompt to an excel prompt and excel is able to do math.
The strawberry issue isn’t merely about tokenization. It’s about speed. All current LLMs are programmed to be fast. In some cases, too fast. That’s why many of the upcoming advances will be in making them slow down and “think”, just like people do. If you encourage an LLM to ruminate before answering, you’ll get better results.
Grok 2 mini (beta) answered with the following… Here's the count: 1. S - No 'R' 2. T - No 'R' 3. R - Here's the first 'R' 4. A - No 'R' 5. W - No 'R' 6. B - No 'R' 7. E - No 'R' 8. R - Here's the second 'R' 9. R - Here's the third 'R' 10. Y - No 'R' So, in the word "Strawberry", there are **3 'R's**.
4+ Billion People Engaged ⁂ Strategic Innovator, Thinker, & Storyteller ⁂ Elevating Brands & Handcrafting Experiences People Love ⁂ And I Came to Get Down
3moFun to see what folks comment. I especially enjoy the cute ‘splaining and performative putdowns that have become the reply du jour of the InterWebs. Btw, this post wasn't actually about how or why generative AI is able or unable to count the Rs. It's about exploring the language and constructs of how it responded when pointed out/asked—especially, as it concerns Anthropic’s Constitutional AI. But, hey, to each their own.