For People Who Test Things Properly
It’s a prompt testing platform for people tired of manually copying the same prompt into four different chat windows to see which AI does it best.
Drop in your prompt, pick from 14+ models, hit generate, and get results plus the metrics that matter: speed, cost, token usage.
Select any two runs and our comparison engine analyzes the differences with thoroughness that would make your high school English teacher proud.
Notes from Building
Seeing models side-by-side ruins you for single testing. Watching GPT and Claude tackle the same prompt reveals differences no spec sheet mentions. Sometimes subtle, sometimes dramatic, but almost always interesting.
Cost visibility should influence prompting habits. Seeing real-time estimates tied to specific prompts makes you care about token efficiency. The cost slider was our favorite feature — drag it to see what happens when you run your prompt 1,000 times. Some prompts are expensive at scale.
Consistency in AI is a moving target. Same input, same prompt, same model, different outputs. Our comparison engine makes this variability visible in ways that change how you think about consistency with AI.
Good prompts deserve better organization than chat history. We built an icon system — ⚡ for fast prompts, 🌴 for creative ones. It’s hard to very quickly summarize a 1500 character prompt with only a few text changes from the last one. It’s a slightly silly solution, but we found it works.