I was thinking about LLM tokenization (as one does) and had a thought: We select the next output token for an LLM based on its likelihood, but (some) shorter tokens are more likely.
Why? Longer tokens can only complete one word, but some shorter tokens can complete many words. Those shorter common tokens are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete. However, standard generation techniques will only consider a subset of probabilities (top-K) and scale the largest probabilities (temperature). Both of these will take the highest probabilities and increase them further, meaning short/common tokens become significantly more likely to be generated just because they’re shorter.
exfatloss recently wrote about the difference between being satiated and being full, and not experiencing satiety until their 30’s. Thinking about this made me realize that there’s at least four axes of hunger (pangs, appetite, fullness and emotional state), and some interesting edge cases. These hunger feelings are correlated, but don’t always occur together, and sometimes they even point in opposite directions.
I’ve gone snowboarding about 30 times since I started learning a few years ago, but every time I’m on a lift, most of the other riders have been out 90 days just this season. In fact, almost everyone I see has been skiing or snowboarding for decades, and comes out almost every day.
It’s hard to stay motivated when I’m the worst snowboarder on the mountain.
This might seem like a big coincidence, but I’m also one of the worst runners I know AND one of the worst writers that I’m aware of.
I used to get carpel tunnel symptoms while working on a computer all day, and the thing that finally solved it was a vertical mouse. Unfortunately, there’s only a couple options, and the one I like best has an annoying issue where the wheel wears out after a year or so. It’s cheap enough that this wasn’t a huge deal, but I finally got around to trying to fix it and realized it’s stupidly easy.
Claude has trouble playing Pokemon partially because it can’t see the screen very well. This made me wonder if Claude would be better at an ASCII game like Dwarf Fortress, where it doesn’t need to rely on image recognition.
To check this, I built an MCP server to let Claude control an interactive terminal, and installed a text version of Dwarf Fortress.