‌

‌
‌
‌
‌
‌

Will high-impact research on reducing the sample complexity of Large Language Model pretraining be forthcoming before 2026?

★★★☆☆

65%

Likely

Yes

Question description #

Large Language Models (LLMs) are famously data hungry, with the largest among today's models requiring >1T tokens for optimal training. This high sample complexity has several important implications. For one thing, as reported by Epoch recently, current LLMs may already be leveraging almost all available high-quality text data, and the stock is not growing anywhere near fast enough to sustain the current rate of progress. For another, high data requirements lead to high compute requirements, meaning that only well-resourced actors are able to train LLMs. If techniques for making better use of available data during LLM pretraining were to be invented, this might remove data as a bottleneck to progress, and could increase the dispersion of powerful models among actors.

Indicators #

Indicator	Value
Stars	★★★☆☆
Platform	Metaculus
Number of forecasts	88

Capture #

Resizable preview:

Will high-impact research on reducing the sample complexity of Large Language Model pretraining be forthcoming before 2026?

65%

Likely

Last updated: 2024-10-07

★★★☆☆

Metaculus

Forecasts: 88

Embed #

Preview