MetaforecastStatus
SearchToolsAbout

‌

‌
‌
‌
‌
‌
‌

Will high-impact research on reducing the sample complexity of Large Language Model pretraining be forthcoming before 2026?

Metaculus
★★★☆☆
65%
Likely
Yes

Question description #

Large Language Models (LLMs) are famously data hungry, with the largest among today's models requiring >1T tokens for optimal training. This high sample complexity has several important implications. For one thing, as reported by Epoch recently, current LLMs may already be leveraging almost all available high-quality text data, and the stock is not growing anywhere near fast enough to sustain the current rate of progress. For another, high data requirements lead to high compute requirements, meaning that only well-resourced actors are able to train LLMs. If techniques for making better use of available data during LLM pretraining were to be invented, this might remove data as a bottleneck to progress, and could increase the dispersion of powerful models among actors.

Indicators #

IndicatorValue
Stars
★★★☆☆
PlatformMetaculus
Number of forecasts88

Capture #

Resizable preview:
Will high-impact research on reducing the sample complexity of Large Language Model pretraining be forthcoming before 2026?
65%
Likely
Last updated: 2024-10-07

Large Language Models (LLMs) are famously data hungry, with the largest among today's models requiring >1T tokens for optimal training. This high sample complexity has several important implications. For one thing, as reported by Epoch recently,...

Last updated: 2024-10-07
★★★☆☆
Metaculus
Forecasts: 88

Embed #

<iframe src="https://metaforecast.org/questions/embed/metaculus-14415" height="600" width="600" frameborder="0" />

Preview