Evaluating LLMs based on a few online demos is unreliable because single examples don’t capture a model’s varied behaviors across tasks.