▲What's the strongest AI model you can train on a laptop in five minutes?seangoedecke.com

202 points by ingve 2 days ago | 57 comments

jebarker 53 minutes ago [-]

Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.

biophysboy 35 minutes ago [-]

It’s a fun analogy because the data “environment” of the model being trained matters a great deal

aniijbod 51 minutes ago [-]

Let the AI efficiency olympics begin!

On a laptop, on a desktop, on a phone?

Train for 5 minutes, an hour, a day, a week?

On a boat? With a goat?

Nevermark 14 minutes ago [-]

On a maxxxed out Mac Studio M3 Ultra 512GB.

That boat will float your goat!

rPlayer6554 19 minutes ago [-]

I’d pay for GoatLM

visarga 21 minutes ago [-]

goats have too many parameters, they are like GPT-4

lifestyleguru 12 minutes ago [-]

Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.

zarzavat 3 hours ago [-]

Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.

NooneAtAll3 3 hours ago [-]

it's not about efficiency - it's about availability

H100 is not an everyday product. Laptop is

Der_Einzige 54 minutes ago [-]

At this point, given how many H100s there are in existence, it’s basically an everyday product.

logicchains 47 minutes ago [-]

I envy you if $25k is an everyday product cost.

falcor84 36 minutes ago [-]

Maybe not to buy one, but to rent one. Like how barista-made coffee is an everyday product even though most people can't afford a fancy professional coffee machine.

jeroenhd 40 minutes ago [-]

For what it's worth, most of the world can't afford an M4 Macbook either.

wongarsu 32 minutes ago [-]

And renting an H100 for an hour is a lot easier than renting an M4 MacBook for an hour.

KeplerBoy 2 hours ago [-]

Still, I don't think the m4 is going to be far off from the h100 in terms of energy efficiency.

edit: fixed typo

menaerus 2 hours ago [-]

What efficiency did you have in mind? Bandwidth-wise M4 is ~10x to ~30x lower.

KeplerBoy 2 hours ago [-]

ah, i mistyped. I meant energy efficiency, not memory efficiency.

giancarlostoro 2 hours ago [-]

Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.

On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.

dtnewman 2 hours ago [-]

> you can rent an H100 for an hour for under $10

Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.

bigyabai 18 minutes ago [-]

It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.

In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.

netcan 46 minutes ago [-]

They're all good. Being somewhat arbitrary isnt a bad thing.

jvanderbot 25 minutes ago [-]

Bro por que no los dos

We can / should benchmark and optimize this to death on all axes

LorenDB 2 hours ago [-]

> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.

I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?

api 1 hours ago [-]

[flagged]

emeril 57 minutes ago [-]

well, don't forget the secretary of education refers to AI as "A1" like the steak sauce so it all tracks

chias 39 minutes ago [-]

This is not true. I watched the clip. She referred to AI as AI. When she said A1 she was very clearly referring to America First.

quaristice 25 minutes ago [-]

Snopes confirmed that McMahon began by referring correctly to “AI development,” but in the same response, twice said “A1 teaching,” clearly meaning artificial intelligence. Not steak sauce. Multiple outlets including Gizmodo, Newser, Yahoo News, Mediaite, and Cybernews all reported the slip-up as genuine: she erroneously said “A1” when she meant “AI”.

tootyskooty 2 hours ago [-]

I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.

[0]: https://github.com/KellerJordan/modded-nanogpt

Aperocky 58 minutes ago [-]

At which point is a simple markov chain same/better?

Nevermark 34 seconds ago [-]

It is the other way around.

Neural-type models have long past the point where markov chains made any sense by many orders of magnitude.

Markov models fail by being too opinionated about the style of compute.

In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. And given a large enough tensor, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.

All other neural architectures then are simply sparser arrangements, that bring compute demands down, where the sparseness is fit to the problem.

visarga 20 minutes ago [-]

Output text is word salad every few words apart. You can't scale n-gram counting enough to make it work.

sadiq 4 minutes ago [-]

You might find https://arxiv.org/abs/2401.17377v3 interesting..

yalogin 24 minutes ago [-]

The bigger question or may be even realization is that with this architecture there is no way to build a capable model to run on the laptop or phone, which means there will never be local compute and servers became ever more important. In general thinking about how ML itself works, reducing model size while retaining capability will just never happen.

simonw 19 minutes ago [-]

This post is about training, not inference.

The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.

That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.

> reducing model size while retaining capability will just never happen.

Tell that to Qwen3-4B! Those models are remarkably capable.

grim_io 13 minutes ago [-]

It's always a question of "compared to what?"

Local models are no where near capable compared to frontier big models.

While a small model might be fine for your use case, it can not replace Sonnet-4 for me.

bbarnett 3 hours ago [-]

Perhaps grimlock level:

https://m.youtube.com/shorts/4qN17uCN2Pg

treetalker 3 hours ago [-]

"Hadn't thought of that …"

"You're absolutely right!"

nottorp 2 hours ago [-]

But supposing you have a real specific need to train, is the training speed still relevant? Or do the resources spent on gathering and validating the data set dwarf the actual CPU/GPU usage?

wongarsu 23 minutes ago [-]

If training is trivially fast that allows you to iterate on architecture choices, hyperparameters, choices which data to include, etc

Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable

highfrequency 1 hours ago [-]

This is awesome - thanks for sharing. Appreciate the small-scale but comprehensive studies testing out different architectures, model sizes and datasets.

Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.

(Also interesting would be learning curve comparisons between architecture/param count)

pilooch 26 minutes ago [-]

I'd be interested in what implementation of D3PM was used (and failed). Diffusion model are more data efficient than their AR LLM counterpart but les compute efficient at training time, so it'd be interesting to know whether with more time.to.converge the diffusion approach does succeed. I guess I'll try :)

l5870uoo9y 2 hours ago [-]

The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.

alberth 2 hours ago [-]

The Mac Studio has:

  32 CPU
  80 GPU
  512GB RAM

https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...

lukan 1 hours ago [-]

That's a well made page, describing nice hardware, but doesn't seem to be a laptop.

Joel_Mckay 1 hours ago [-]

From https://opendata.blender.org/ :

Apple M3 Ultra (GPU - 80 cores) scores 7235.31

NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31

Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV

https://www.youtube.com/watch?v=d8yS-2OyJhw

https://www.youtube.com/watch?v=Ju0ndy2kwlw

Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3

hodgehog11 2 hours ago [-]

I love seeing explorations like this, which highlight that easily accessible hardware can do better than most people think with modern architectures. For many novel scientific tasks, you really don't need an H100 to make progress using deep learning over classical methods.

wowczarek 2 hours ago [-]

Not the point of the exercise obviously, but at five minutes' training I wonder how this would compare to a Markov chain bot.

mhogers 2 hours ago [-]

Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026 M5) for local LLMs? Due an upgrade soon and perhaps it is educational to run these things more easily locally?

sandreas 1 hours ago [-]

For LLMs, VRAM is the requirement number one. Since MacBooks have unified RAM you can use up to 75% for the LLM, so a higher RAM model would open more possibilies, but these are much more expensive (of course).

As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB

ionwake 2 hours ago [-]

I did just that , got the r 32gb ram one so I could run qwen.

Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.

Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.

schaefer 1 hours ago [-]

You could train an unbeatable tic-tac-toe ai on your laptop in five minutes. It doesn’t get any stronger than that.

—

I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…

But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:

These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.

silverlake 20 minutes ago [-]

I’m actually working on just this. What’s the smallest training data set required to learn tic-tac-toe? A 5yo doesn’t need much training to learn a new game, but a transformer needs millions of samples.

Daltonagray 6 minutes ago [-]

This sounds super interesting. Will you be sharing your work anywhere? :)

rkomorn 17 minutes ago [-]

> A 5yo doesn’t need much training to learn a new game

A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.

silverlake 8 minutes ago [-]

It’s a glib analogy, but the goal remains the same. Today’s training sets are immense. Is there an architecture that can learn something with tiny training sets?

pjmlp 55 minutes ago [-]

Which laptop, though?

yunusabd 41 minutes ago [-]

Now imagine what you could do in 6 minutes!

But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.

evrennetwork 2 hours ago [-]

[dead]

lamuswawir 3 hours ago [-]

Thanks.

Loading comments...

jebarker 53 minutes ago [-]

biophysboy 35 minutes ago [-]

It’s a fun analogy because the data “environment” of the model being trained matters a great deal

aniijbod 51 minutes ago [-]

Let the AI efficiency olympics begin!

On a laptop, on a desktop, on a phone?

Train for 5 minutes, an hour, a day, a week?

On a boat? With a goat?

Nevermark 14 minutes ago [-]

On a maxxxed out Mac Studio M3 Ultra 512GB.

That boat will float your goat!

rPlayer6554 19 minutes ago [-]

I’d pay for GoatLM

visarga 21 minutes ago [-]

goats have too many parameters, they are like GPT-4

lifestyleguru 12 minutes ago [-]

Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.

zarzavat 3 hours ago [-]

Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.

NooneAtAll3 3 hours ago [-]

it's not about efficiency - it's about availability

H100 is not an everyday product. Laptop is

Der_Einzige 54 minutes ago [-]

At this point, given how many H100s there are in existence, it’s basically an everyday product.

logicchains 47 minutes ago [-]

I envy you if $25k is an everyday product cost.

falcor84 36 minutes ago [-]

Maybe not to buy one, but to rent one. Like how barista-made coffee is an everyday product even though most people can't afford a fancy professional coffee machine.

jeroenhd 40 minutes ago [-]

For what it's worth, most of the world can't afford an M4 Macbook either.

wongarsu 32 minutes ago [-]

And renting an H100 for an hour is a lot easier than renting an M4 MacBook for an hour.

KeplerBoy 2 hours ago [-]

Still, I don't think the m4 is going to be far off from the h100 in terms of energy efficiency.

edit: fixed typo

menaerus 2 hours ago [-]

What efficiency did you have in mind? Bandwidth-wise M4 is ~10x to ~30x lower.

KeplerBoy 2 hours ago [-]

ah, i mistyped. I meant energy efficiency, not memory efficiency.

giancarlostoro 2 hours ago [-]

Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.

On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.

dtnewman 2 hours ago [-]

> you can rent an H100 for an hour for under $10

Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.

bigyabai 18 minutes ago [-]

It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.

In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.

netcan 46 minutes ago [-]

They're all good. Being somewhat arbitrary isnt a bad thing.

jvanderbot 25 minutes ago [-]

Bro por que no los dos

We can / should benchmark and optimize this to death on all axes

LorenDB 2 hours ago [-]

I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?

api 1 hours ago [-]

[flagged]

emeril 57 minutes ago [-]

well, don't forget the secretary of education refers to AI as "A1" like the steak sauce so it all tracks

chias 39 minutes ago [-]

This is not true. I watched the clip. She referred to AI as AI. When she said A1 she was very clearly referring to America First.

quaristice 25 minutes ago [-]

tootyskooty 2 hours ago [-]

I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.

[0]: https://github.com/KellerJordan/modded-nanogpt

Aperocky 58 minutes ago [-]

At which point is a simple markov chain same/better?

Nevermark 34 seconds ago [-]

It is the other way around.

Neural-type models have long past the point where markov chains made any sense by many orders of magnitude.

Markov models fail by being too opinionated about the style of compute.

All other neural architectures then are simply sparser arrangements, that bring compute demands down, where the sparseness is fit to the problem.

visarga 20 minutes ago [-]

Output text is word salad every few words apart. You can't scale n-gram counting enough to make it work.

sadiq 4 minutes ago [-]

You might find https://arxiv.org/abs/2401.17377v3 interesting..

yalogin 24 minutes ago [-]

simonw 19 minutes ago [-]

This post is about training, not inference.

The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.

That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.

> reducing model size while retaining capability will just never happen.

Tell that to Qwen3-4B! Those models are remarkably capable.

grim_io 13 minutes ago [-]

It's always a question of "compared to what?"

Local models are no where near capable compared to frontier big models.

While a small model might be fine for your use case, it can not replace Sonnet-4 for me.

bbarnett 3 hours ago [-]

Perhaps grimlock level:

https://m.youtube.com/shorts/4qN17uCN2Pg

treetalker 3 hours ago [-]

"Hadn't thought of that …"

"You're absolutely right!"

nottorp 2 hours ago [-]

But supposing you have a real specific need to train, is the training speed still relevant? Or do the resources spent on gathering and validating the data set dwarf the actual CPU/GPU usage?

wongarsu 23 minutes ago [-]

If training is trivially fast that allows you to iterate on architecture choices, hyperparameters, choices which data to include, etc

Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable

highfrequency 1 hours ago [-]

This is awesome - thanks for sharing. Appreciate the small-scale but comprehensive studies testing out different architectures, model sizes and datasets.

(Also interesting would be learning curve comparisons between architecture/param count)

pilooch 26 minutes ago [-]

l5870uoo9y 2 hours ago [-]

alberth 2 hours ago [-]

The Mac Studio has:

  32 CPU
  80 GPU
  512GB RAM

https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...

lukan 1 hours ago [-]

That's a well made page, describing nice hardware, but doesn't seem to be a laptop.

Joel_Mckay 1 hours ago [-]

From https://opendata.blender.org/ :

Apple M3 Ultra (GPU - 80 cores) scores 7235.31

NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31

Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV

https://www.youtube.com/watch?v=d8yS-2OyJhw

https://www.youtube.com/watch?v=Ju0ndy2kwlw

Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3

hodgehog11 2 hours ago [-]

wowczarek 2 hours ago [-]

Not the point of the exercise obviously, but at five minutes' training I wonder how this would compare to a Markov chain bot.

mhogers 2 hours ago [-]

Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026 M5) for local LLMs? Due an upgrade soon and perhaps it is educational to run these things more easily locally?

sandreas 1 hours ago [-]

ionwake 2 hours ago [-]

I did just that , got the r 32gb ram one so I could run qwen.

Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.

schaefer 1 hours ago [-]

You could train an unbeatable tic-tac-toe ai on your laptop in five minutes. It doesn’t get any stronger than that.

—

I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…

But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:

These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.

silverlake 20 minutes ago [-]

Daltonagray 6 minutes ago [-]

This sounds super interesting. Will you be sharing your work anywhere? :)

rkomorn 17 minutes ago [-]

> A 5yo doesn’t need much training to learn a new game

A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.

silverlake 8 minutes ago [-]

It’s a glib analogy, but the goal remains the same. Today’s training sets are immense. Is there an architecture that can learn something with tiny training sets?

pjmlp 55 minutes ago [-]

Which laptop, though?

yunusabd 41 minutes ago [-]

Now imagine what you could do in 6 minutes!

But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.

evrennetwork 2 hours ago [-]

[dead]

lamuswawir 3 hours ago [-]

Thanks.