Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is awesome. If you try the demo they provide [0], the inference is handled purely in the client using a ONNX model that only weights around ~8Mb [1] [2].

Really impressive stuff! Congrats to the team that achieved it

[0] https://segment-anything.com/demo

[1] https://segment-anything.com/model/interactive_module_quanti...

[2] https://segment-anything.com/model/interactive_module_quanti...



It isn't purely client-side. The embeddings are generated on the server so your image is still sent to Meta for processing. https://github.com/facebookresearch/segment-anything/issues/...


The linked issue says nothing of the sort. Are you sure?


I wrote this part of the code. The features are computed on the server.


SAM uses a CLIP-H as an image encoder, a very large encoder that obviously does not fit in 8 MB.

Also read the "Efficient & flexible model design" section on the page.

Also in the FAQ: "How big is the model?

The image encoder has 632M parameters. The prompt encoder and mask decoder have 4M parameters."


I cannot edit the comment anymore but I should have written MAE ViT-H and not CLIP-H (same size but CLIP models are trained in a different way).


MAE ViT-H is available freely, in fact the entirety of the SAM pipeline is open source, and they even released the dataset.


I didn't say otherwise.


Is MAE ViT-H available freely? My system could certainly handle it locally if it can be obtained and dropped in.


Yup


Yes, FAQ said:

> What platforms does the model use? > The image encoder is implemented in PyTorch and requires a GPU for efficient inference. > The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.


Can you clarify how this quoted text means the image is sent to the server for processing?


You can download the model yourself on GitHub and run it locally. The biggest one is about 2.5GB and certainly took some time on my M1 CPU. I couldn't get mps to run as the tensor dtypes are incompatible (could be a quick fix).

The small ONNX model just decodes the output of the larger model into the masks etc. But the bulk of the "computation" is done by a much larger vision transformer somewhere else. It really needs a GPU with a fair amount of memory to run anywhere close to real-time.


> The image encoder is implemented in PyTorch and requires a GPU for efficient inference.

WebGPU just shipped today in Chrome, there are no people reporting that the demo doesn't work with their days-old browser, so it doesn't use WebGPU.

While it's possible, without WebGPU it's really tedious to run NN in the browser.

Also, the model is implemented in PyTorch and wasn't converted to other model format for other runtime. While technically you can compile CPython and PyTorch to WASM and run the duo in browser, there are definitely no GPU access.

Given that they explicitly mentioned the decoder was converted to ONNX, it's obvious this isn't done for the encoder and they really mean PyTorch, running with Python, on a server.

Okay, so you browser can't run the encoder, yet the web demo works, it's quite obvious on which server the encoder run.


There might be a slight miscommunication here.

I downloaded the code from their repo, exported their pytorch model to onnx, and ran a prediction against it. Everything ran locally on my system (cpu, no cuda cores) and a prediction for the item to be annotated was made.


What is the difference between those two model links?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: