When we first started interacting with LLMs, it felt unpredictable. Type a random series of words, hit enter, and something new appeared. After a little while though, you realize: a well-crafted prompt can completely change the outcome. Sometimes you land on one you really like. But like any chat app, it gets swept away in a stream of conversations, nearly impossible to find again later.

We got better at this as we began testing different models in these newsletter experiments. But still, we were swapping between dashboards, trying to measure exactly how fast one prompt ran compared to another on totally separate platforms. Once you start scaling this into larger applications, accepting that the prompts you spent the last hour crafting are lost inside the black box doesn’t feel great, especially three weeks later when you’re desperately trying to recreate whatever phrasing actually worked.

It’s like baking without writing down the recipe. You mix and tweak until something comes out just right, but if you don’t record it, you’re back to guessing next time.

Good prompts are like good recipes: they’re harder to recreate than you think, and the best ones deserve to be remembered. (Okay, to be fair, the oven never hallucinates and turns your cake into lasagna.)

So we built a thing.

I lost my voice while building this, so I asked a nice British man to help narrate.

For People Who Test Things Properly

It’s a prompt testing platform for people tired of manually copying the same prompt into four different chat windows to see which AI does it best.

Drop in your prompt, pick from 14+ models, hit generate, and get results plus the metrics that matter: speed, cost, token usage.

Select any two runs and our comparison engine analyzes the differences with thoroughness that would make your high school English teacher proud.

Notes from Building

Seeing models side-by-side ruins you for single testing. Watching GPT and Claude tackle the same prompt reveals differences no spec sheet mentions. Sometimes subtle, sometimes dramatic, but almost always interesting.

Cost visibility should influence prompting habits. Seeing real-time estimates tied to specific prompts makes you care about token efficiency. The cost slider was our favorite feature — drag it to see what happens when you run your prompt 1,000 times. Some prompts are expensive at scale.

Consistency in AI is a moving target. Same input, same prompt, same model, different outputs. Our comparison engine makes this variability visible in ways that change how you think about consistency with AI.

Good prompts deserve better organization than chat history. We built an icon system — ⚡ for fast prompts, 🌴 for creative ones. It’s hard to very quickly summarize a 1500 character prompt with only a few text changes from the last one. It’s a slightly silly solution, but we found it works.

Want a thing?

Start with a Township sprint and get your own Thing in as little as 2 weeks.

I want a thing!

Built with ❤️ by Township

You’re receiving this because you signed up for Township’s “We Built a Thing” newsletter. We'll only send these when we, well, build things.

All the Prompts You’ve Loved Before Are Waiting Here

For People Who Test Things Properly

Notes from Building