Resolves subjectively, based on my analysis of benchmarks both official, third party, and my own.
Some examples of benchmarks I consider are MMLU, ZebraLogic, SWE-bench, simplebench, ARC, and livebench.
Some of my own evals are game-playing (tic-tac-toe, and connect 4), and creative writing (giving a model 3 random nouns and asking it to write a story involving them)
| Indicator | Value |
|---|---|
| Stars | ★★☆☆☆ |
| Platform | Manifold Markets |
| Forecasters | 17 |
| Volume | M1.8k |
Resolves subjectively, based on my analysis of benchmarks both official, third party, and my own.
Some examples of benchmarks I consider are MMLU, ZebraLogic, SWE-bench, simplebench, ARC, and livebench.
Some of my own evals are game-playing...