Segment Anything Model (SAM) can "cut out" any object in an image

syrusakbary · on April 5, 2023

This is awesome. If you try the demo they provide [0], the inference is handled purely in the client using a ONNX model that only weights around ~8Mb [1] [2].

Really impressive stuff! Congrats to the team that achieved it

[0] https://segment-anything.com/demo

[1] https://segment-anything.com/model/interactive_module_quanti...

[2] https://segment-anything.com/model/interactive_module_quanti...

kevmo314 · on April 6, 2023

It isn't purely client-side. The embeddings are generated on the server so your image is still sent to Meta for processing. https://github.com/facebookresearch/segment-anything/issues/...

trzy · on April 6, 2023

The linked issue says nothing of the sort. Are you sure?

Q6T46nT668w6i3m · on April 6, 2023

I wrote this part of the code. The features are computed on the server.

GaggiX · on April 6, 2023

SAM uses a CLIP-H as an image encoder, a very large encoder that obviously does not fit in 8 MB.

Also read the "Efficient & flexible model design" section on the page.

Also in the FAQ: "How big is the model?

The image encoder has 632M parameters. The prompt encoder and mask decoder have 4M parameters."

GaggiX · on April 6, 2023

I cannot edit the comment anymore but I should have written MAE ViT-H and not CLIP-H (same size but CLIP models are trained in a different way).

tehsauce · on April 6, 2023

MAE ViT-H is available freely, in fact the entirety of the SAM pipeline is open source, and they even released the dataset.

GaggiX · on April 6, 2023

I didn't say otherwise.

trzy · on April 6, 2023

Is MAE ViT-H available freely? My system could certainly handle it locally if it can be obtained and dropped in.

GaggiX · on April 6, 2023

rfoo · on April 6, 2023

Yes, FAQ said:

> What platforms does the model use? > The image encoder is implemented in PyTorch and requires a GPU for efficient inference. > The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.

cal85 · on April 6, 2023

Can you clarify how this quoted text means the image is sent to the server for processing?

joshvm · on April 6, 2023

You can download the model yourself on GitHub and run it locally. The biggest one is about 2.5GB and certainly took some time on my M1 CPU. I couldn't get mps to run as the tensor dtypes are incompatible (could be a quick fix).

The small ONNX model just decodes the output of the larger model into the masks etc. But the bulk of the "computation" is done by a much larger vision transformer somewhere else. It really needs a GPU with a fair amount of memory to run anywhere close to real-time.

rfoo · on April 6, 2023

> The image encoder is implemented in PyTorch and requires a GPU for efficient inference.

WebGPU just shipped today in Chrome, there are no people reporting that the demo doesn't work with their days-old browser, so it doesn't use WebGPU.

While it's possible, without WebGPU it's really tedious to run NN in the browser.

Also, the model is implemented in PyTorch and wasn't converted to other model format for other runtime. While technically you can compile CPython and PyTorch to WASM and run the duo in browser, there are definitely no GPU access.

Given that they explicitly mentioned the decoder was converted to ONNX, it's obvious this isn't done for the encoder and they really mean PyTorch, running with Python, on a server.

Okay, so you browser can't run the encoder, yet the web demo works, it's quite obvious on which server the encoder run.

blandcoffee · on April 6, 2023

There might be a slight miscommunication here.

I downloaded the code from their repo, exported their pytorch model to onnx, and ran a prediction against it. Everything ran locally on my system (cpu, no cuda cores) and a prediction for the item to be annotated was made.

msfeldstein · on April 14, 2023

What is the difference between those two model links?

yeldarb · on April 5, 2023

Wow, this is pretty epic. I put it through its paces on a pretty wide variety of images that have tripped up recent zero-shot models[1] and am thoroughly impressed.

We have a similar "Smart Polygon" tool[2] built into Roboflow but this is next level. Having the model running in the browser makes it so much more fun to use. Stoked it's open source; we're going to work on adding it to our annotation tool ASAP.

[1] Some examples from Open Flamingo last week https://news.ycombinator.com/item?id=35348500

[2] https://blog.roboflow.com/automated-polygon-labeling-compute...

teruakohatu · on April 6, 2023

It would be nice if it could run entirely in the browser but the image embedding model is absolutely huge.

blandcoffee · on April 6, 2023

Can you expand further on what you mean by this?

teruakohatu · on April 6, 2023

The SAM model is small (4m params) , but requires image embedding to be computed from what is I think a 600m params model. Right now the demo uploads the image to get the embeddings, then runs the actual segmentation locally.

crakenzak · on April 5, 2023

This is going along with the new Segment Anything Model paper Meta AI just released:

Paper: https://scontent-sea1-1.xx.fbcdn.net/v/t39.2365-6/10000000_6...

Announcement: https://ai.facebook.com/blog/segment-anything-foundation-mod...

Code & Model Weights: https://github.com/facebookresearch/segment-anything

jauer · on April 5, 2023

That paper link is a CDN URL that is dynamically generated to point to your closest POP when you load the abstract. It will be different for many people and will break eventually.

Abstract: https://ai.facebook.com/research/publications/segment-anythi...

ftxbro · on April 5, 2023

if Tim Berners-Lee saw that paper link he would have never allowed the url to be invented

LoganDark · on April 5, 2023

I'm so shocked by how almost every query parameter is required and there's even a freaking signature for validating the URL itself.

-Emily

TylerE · on April 6, 2023

Signature is likely so the caching layer doesn't have to do all that validation, at least as an initial check.

egeozcan · on April 6, 2023

If I'm remembering correctly, people were accessing the original versions of the images uploaded to instagram (removing filters, and even masks) using parameter engineering (We love putting engineering at the end of everything that requires more than 5 seconds of thought these days, so why not). That could be why.

CyberDildonics · on April 5, 2023

Who is Emily?

LoganDark · on April 6, 2023

That would be me. We have DID and add a signature when the person commenting isn't actually Logan. It's not actually intended to be content.

-Emily

CyberDildonics · on April 6, 2023

Why not have two different user names? What if you get logged out and the other identity is the one that knows the password?

lexandstuff · on April 6, 2023

The rule is one username per physical body, even if multiple identities are sharing it.

CyberDildonics · on April 6, 2023

Where is that rule written?

LoganDark · on April 6, 2023

> Why not have two different user names?

We do on some platforms like Discord.

> What if you get logged out and the other identity is the one that knows the password?

Before we moved to a password manager we used the same password everywhere. Now, obviously, we have a password manager.

-Emily

CyberDildonics · on April 6, 2023

Don't you run the risk of only one person knowing the password to the password manager?

LoganDark · on April 6, 2023

> Don't you run the risk of only one person knowing the password to the password manager?

No, because systems don't necessarily work that way. For us, the boundaries between our members aren't totally uncrossable. Information has gotten through in the past when it would be especially important or needed.

Though I guess it's funny that this topic comes up under a post called "segment anything". I guess our brain did that~

-Emily

revolvingocelot · on April 5, 2023

Someone whose signature validates the grandparent comment itself, I assume

lofaszvanitt · on April 5, 2023

Why can't they give proper filenames to these research papers. This drives me nuts.

Loranubi · on April 6, 2023

https://arxiv.org/abs/2304.02643

reaperman · on April 5, 2023

Multiple block diagrams and the paper note that one of the inputs is supposed to be "text", but none of the example Jupyter notebooks or the live demo page show how to use those. I'm assuming just run the text into CLIP, take the resulting embedding, and throw it directly in as a prompt, which then gets re-encoded by the SAM prompt encoder?

> "Prompt encoder. We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82]. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding."

Edit: Found the answer myself: https://github.com/facebookresearch/segment-anything/issues/...

dimatura · on April 5, 2023

The network architecture and scale don't seem to be a big departure from recent SOTA, but a pretty massive amount of labeled data went into it. And it seems to work pretty well! The browser demo is great. This will probably see a lot of use, especially considering the liberal licensing.

bjacobt · on April 5, 2023

I apologize if this is obvious, but are both the model and checkpoint (as referenced in getting started section in readme) Apache 2.0? Can it be used for commercial applications?

dimatura · on April 5, 2023

As far as I can tell, it can. The code itself has a `LICENSE` file with the Apache license, and the readme says "The model is licensed under the Apache 2.0 license.". Strangely, the FAQ in the blog post doesn't address this question, which I expect will be frequent.

phkahler · on April 5, 2023

Isn't Apache 2 a free software license without some of the GPLv3 things some don't like?

I think a more BSD would be better, or LGPL. Either would be more business friendly.

nl · on April 6, 2023

Apache 2 is just about as business friendly as you can get. It's:

* Do what you want

* Don't sue us

* You license any patents you control and used in this work. If you sue someone for patent violation for using this then other entities can counter sue you for violating any of their patents used in this work.

There is no viral nature, and it is older than GPLv3.

It's most simialr to BSD of the licenses you list.

MacsHeadroom · on April 5, 2023

LGPL is not business friendly at all. It's among the least business friendly licenses there is. Apache 2.0 is slightly more business friendly than BSD.

With some caveats, software licenses from most to least business friendly roughly go:

Apache > BSD > MIT > MPL > LGPL > GPL > AGPL

kyle-rb · on April 5, 2023

LGPL is more business friendly than GPL; it's literally "lesser" GPL.

You can use LGPL in commercial, closed-source projects as long as you keep the LGPL code in a separate dynamically linked library, e.g. a DLL, and provide a way for users to swap it out for their own patched DLL if they wish. (Plus some other license terms.)

Also, you can always use LGPL code under the terms of the GPL, so there's no way LGPL is more restrictive than GPL.

robocat · on April 6, 2023

Beware that you may need to be careful using LGPL code in a browser: JavaScript is source code not object code, arguing WASM is a DLL wouldn’t help, most JavaScript minifiers perform static linking, and sending LGPL code to the browser could be considered distribution. I always avoided all LGPL licensed libraries when doing commercial front-end work.

MacsHeadroom · on April 5, 2023

You're right, that was a mistake. It's been fixed. LGPL > GPL

benatkin · on April 5, 2023

They seem to avoid using their own brand a lot. They have a zillion domain names and they register a new one and don't use the logo except in the favicon and footer. I've seen similar stuff including divesting OSS projects like PyTorch and GraphQL which Google wouldn't. To me that's tacit admission that the Facebook and Meta names are tarnished. And they are, by the content they showed users in Myanmar with the algorithmic feed, and by Cambridge Analytica. Maybe the whole "Meta" name is no different from the rebranding of Philip Morris.

iambateman · on April 5, 2023

Welcome to the wild world of corporate IT. Their VP has authority to make a new website if she wants, but has to go through a 3 month vetting process to put on a subdomain.

lacker · on April 5, 2023

As someone who used to work on Facebook open source, that makes sense! After all, an insecure subdomain could lead to all sorts of problems on facebook.com. Phishing, stealing cookies, there's a lot of ways it could go wrong.

Whereas, if one engineer spins up some random static open source documentation website on AWS, it really can't go wrong in a way that causes trouble for the rest of the company.

iambateman · on April 6, 2023

My initial comment was sardonic but this is a good point.

My IT experiences elsewhere have left me a little jaded. :)

nonoob · on April 6, 2023

I wasn't aware of that, but it's intriguing! Eager to learn more about subdomains and vulnerabilities - any resources you'd recommend?

MF-DOOM · on April 6, 2023

Read about the Same origin Policy and Content Security Policy. MDN is a canonical resource for this.

benatkin · on April 6, 2023

And you would learn that if you don't have wildcard cookies, which I generally wouldn't recommend, subdomains are isolated from each other. But with meta if the brand weren't tarnished, a new domain for subdomains like Google's withgoogle.com and web.dev would be a good place to add sites like this rather than subdomain.facebook.com

benatkin · on April 5, 2023

Meta isn't a typical corporation, though. Ordinary big company red tape could have stopped them from indirectly displacing thousands based on their religion. (That isn't an outlandish claim but is something they actually got sued for, though it was dismissed without absolving them of it)

herval · on April 5, 2023

It very much is a typical big corp, and OP is correct. It's easier to ship something on a new domain, using AWS and a bunch of contractors, than to add a subdomain to facebook.com or some other top-level domain

smoldesu · on April 5, 2023

Not to mention, the "Ordinary big company red tape" didn't stop Coca Cola from hiring Colombian death squads, Nestle from draining the Great Lakes and selling it back to it's residents, nor Hershey's from making chocolate from cacao farmed with child slave labor.

Relative to the rest of FAANG (or even Fortune 500), Facebook might have the least blood on their hands when everything is said and done.

killerdhmo · on April 5, 2023

um... did you sleep through the last 8+ years of handwringing about election interference, Russian / state propaganda, live streaming massacres, addiction / mental health effects of social media, particular for kids? I can't imagine the other FAANGs come close

smoldesu · on April 5, 2023

If platforming disinformation and enabling internet addiction is equivalent to criminal complacency, then Microsoft, Apple, Amazon and Google all have crimes to answer for. Facebook has shit the bed more times than they can count on two hands, but unfortunately that's kinda the table-stakes in big tech.

renewiltord · on April 5, 2023

I actually have a much more positive impression of Meta because of this work. It's hard to describe, but they feel very competent. My instant reaction to something being by Meta Research is actually to think it's probably going to be interesting and good.

blululu · on April 5, 2023

What are you talking about? There is a Meta Logo Favicon, "Meta AI" appears in the header and "Meta AI" is purposefully centered in the ABF text. Registering a new domain costs $10 compared to the massive pain of involving legal with the permissions to repurpose a new domain. It's a new project so why not make a clean start and just get a new website instead of going through the full FB/Meta approval process on branding.

fortissimohn · on April 5, 2023

They likely meant that Meta was established in part due to the Facebook name being tarnished in the first place.

benatkin · on April 5, 2023

I mentioned the logo. I didn't mention the text because perhaps they still want to score points for Meta, so hiding it entirely wouldn't make sense. But they avoid the larger immediate hangups of the big logo and the domain name.

smoldesu · on April 5, 2023

On the one hand, sure. Facebook's brand is about as hip as a bag of Werther's Originals.

On the other hand, this is one of those things (like VR) that is a distinctly non-Facebook project. It makes no sense to position or market this as "Facebook" research. The Homepod isn't called the iPod Home for obvious reasons, so it stands to reason that Facebook execs realized selling someone a "Facebook Quest" sounds like a metaphor for ayahuasca. It's not entirely stupid to rebrand, especially considering how diverse (and undeniably advanced) they've become in fields like AI and VR.

oefnak · on April 5, 2023

I actively avoid everything that has anything to do with Facebook, and I can't be the only one.

throwaway290 · on April 5, 2023

Ever used React or PyTorch? Well, this is same. Developers make good stuff regardless of where they work, and good on FB for contributing

But yeah if you do open source adding an element of corporate branding is a sure way to kill the project. That's why it's not called "Apple Swift" or "Microsoft TypeScript".

smoldesu · on April 5, 2023

Yeah, me too. I also avoid everything Apple and Google makes, but I'm not going to pretend like the Alphabet rebranding is their attempt at hiding who they are.

xiphias2 · on April 5, 2023

Alphabet wasn't a rebranding: the founding billionaires got bored of Google, and wanted to take out a few billion dollars per year out of it to create new toys without sharing it with Google.

deltree7 · on April 6, 2023

Yes, many humans make irrational decisions like you

eminence32 · on April 5, 2023

The page says at the very top, in a fixed header that is always visible (even as you scroll, or browse to other pages): "Research by Meta AI"

To me, this feels like they are not avoiding the "Meta" brand at all.

benatkin · on April 5, 2023

See my other comment. Of course they needed to have it somewhere to score points. These probably weren't people who were about to quit it, probably just with a lowered perception of it compared to a company people are mostly proud to work at like Google... https://news.ycombinator.com/edit?id=35458445

thanatropism · on April 5, 2023

I was looking into GPU nearest neighbors libraries today and turned Faiss down because it said "Facebook". Completely irrational, I know.

aftbit · on April 5, 2023

You should use Faiss though, it's good.

nl · on April 6, 2023

Facebookresearch (the Github organization: https://github.com/facebookresearch/) is always great stuff.

benatkin · on April 6, 2023

aftbit · on April 5, 2023

I'm out of the loop, what happened in Myanmar?

aix1 · on April 5, 2023

From an Amnesty International report:

Beginning in August 2017, the Myanmar security forces undertook a brutal campaign of ethnic cleansing against Rohingya Muslims. This report is based on an in-depth investigation into Meta (formerly Facebook)’s role in the serious human rights violations perpetrated against the Rohingya. Meta’s algorithms proactively amplified and promoted content which incited violence, hatred, and discrimination against the Rohingya – pouring fuel on the fire of long-standing discrimination and substantially increasing the risk of an outbreak of mass violence. The report concludes that Meta substantially contributed to adverse human rights impacts suffered by the Rohingya and has a responsibility to provide survivors with an effective remedy.

https://www.amnesty.org/en/documents/ASA16/5933/2022/en/

See also

https://en.wikipedia.org/wiki/Rohingya_genocide

https://en.wikipedia.org/wiki/Rohingya_genocide#Facebook_con...

sashank_1509 · on April 5, 2023

Extremely impressive system. Blows everything else (including CLIP from OpenAI) out of the water. We are inching closer to solving Computer Vision!

wongarsu · on April 5, 2023

It's really impressive, and better than anything I've seen, but is it really leagues better than whatever Photoshop is using?

Of course being on github and permissively license is huge.

vanjajaja1 · on April 5, 2023

My question exactly, didn't photoshop already solve this like 5+ years ago?

dymk · on April 5, 2023

Have you used Photoshop's magic wand tool in the last 5 years? No, it's nowhere close to this good.

Kaijo · on April 6, 2023

The Magic Wand tool mostly just selects similarly colored pixels based on a simple algorithm. The Object Selection, Select Subject and Sky Replacement tools use AI detection and can be configured to run locally or on Adobe's GPUs. Having played with this demo, they seem in a similar league to me.

drummojg · on April 6, 2023

I'm just happy that a tool that PropriAdobe has sat on and crowed about for years is soon to be a click away for the masses and the hackers.

AuryGlenz · on April 6, 2023

They have a newer “object selection” tool that’s similar.

sashank_1509 · on April 7, 2023

I think you haven’t played around enough with it, you can prompt it to segment literally anything in an image. Not just regions of similar texture, it understands humans, dogs, cats etc

dekhn · on April 5, 2023

I have a small number of naive questions. I already have a fine-tuned tardigrade detection model that gives me tardigrade bounding boxes (data comes from labelled images on my microscope). I want to do masks as well.

Right now my home server w/ an RTX 2080 is able to do mask prediction in about ~4 seconds (I'm running the sample script in "directory" mode) per image (640x480).

I'd love to be able to get the first mask back in 0.1ms, so I can do 10FPS on the scope. Is there a practical way to speed things up (my guess would be buying the absolute fastest GPU I can afford)? I can run the object detection to get a reasonable seed location, is that what the paper means by prompting?

bnqscrtm · on April 6, 2023

A 4s run time for object segmentation at 640x480 sounds like it's not using the GPU at all. Something like that should run on a VGA image in at most a few hundred ms.

For the second part of the question, a 2080 should get you close to 10FPS operation. For a ballpark estimate, using an off-the-shelf repo like Ultralytics's YOLOv5 lets you run object detection (not masking) at something like 100FPS. Masking should not add that much overhead.

w.r.t. GPUs yes, these days more money equals more speed for GPU NN inference, though there are diminishing returns. A 3090 might get you the best bang for your buck these days while still having enough VRAM to run fancier models which may need more than the 12 GiB many other GPUs have.

Finally, I haven't read the paper too carefully but I believe that by prompting they mean that you have the option of describing in human language what you want the model to select, rather than the model being "hardwired" to do this. In other words, you could prompt the model to "segment the red car only" and it would do it, rather than just having the model blindly segment every object in the image, and then relying on custom scripting to potentially post-process these segments.

dekhn · on April 6, 2023

I'm using the first model on the SAM website ( ViT-H SAM).

It's definitely using the GPU- I'm running nvidia-smi and I see near 100% utilization on the GPU while the CPU is using 1 core. If I run the script with --device=cpu then I see my server using 4 cpu cores and no GPU and it takes tens o seconds per image.

I'm trying to check with people who have experience with this specific model.

dekhn · on April 7, 2023

I've checked on the repo and other folks report the same numbers as me- in fact my 2080 is just as fast as the 3090 (5 seconds per mask generation/image).

I have bounding box already, so I could prompt the model with that, but all of this runs counter to the published performance numbers.

GC_tris · on April 6, 2023

Yeah, a 3090 should do well.

If you want to try it on one reach out to me (email in profile). We rent those out in the cloud. Would allow you to confirm performance before buying one for local use.

genewitch · on April 6, 2023

the RTX 2080ti 11GB model is a little more than 2 times slower than the flagship RTX (12.5it/s vs 28it/s) for torch/diffusion. extrapolate that from what you will.

however, CLIP/BLIP and boxes should be much faster than 4 seconds, even on a 2080. I had a python CLIP CSV tag script running in a directory with thousands of images and it was taking <=2 seconds per image on a Geforce GTX 1070 TI, with 8GB of memory - an old card without any tensors. CLIP is much slower than some other mechanisms, for instance a deepbooru classifier is about 4x faster than CLIP/BLIP on my RTX 3060 12GB. CLIP of a random image around your dimensions takes ~4-5 seconds, and deepbooru takes about 1.5 seconds. edit: the additional time is the overhead of the webUI, i am guessing

What will probably have to happen is some sort of auto-crop that only forces the model to view a very tiny section of the image. You mentioned you already had a model, was it trained from scratch, or using an existing model?

Ldorigo · on April 6, 2023

What model are you using? 4s/image seems extremely slow. I've been experimenting with Detectron2 and most of their models give me less than 1s on the CPU for instance segmentation on images 4x the size

georgelyon · on April 5, 2023

It seems like the output of this model is masks, but for cropping you really need to be able to pull partial color out of certain pixels (for example, pulling a translucent object out from of a colored background). I tried the demo, and it fails pretty miserably on a vase. Anyone know of a model that can do this well?

teruakohatu · on April 5, 2023

This is a segmentation model, it 'just' creates segmentation masks. You then take the mask and cut out whatever you need from the image itself. If you need to figure out how to remove a transparentency that is another whole problem, probably a generative problem (generate a new texture for the vase).

swframe2 · on April 5, 2023

The best solution I've seen is https://github.com/xuebinqin/DIS. You should try the DIS example images at the SAM site.

The main issue I have with DIS is that creating the labels of my own dataset is super expensive (I think it might be easier to generate the training data using stable diffusion rather than human labelling)

pradyumnabang · on April 6, 2023

Forgive my ignorance, but why is creating the labels super expensive? How much does it cost per image and what's your dataset size ?

swframe2 · on April 6, 2023

It is related to subpixel labelling. When a line/curve in the foreground is smaller than a pixel you end up having to edit the mask one pixel at a time. The authors of DIS are working on a new dataset and model which should work for my use case.

BTW, I used DIS to create the labels of a batch of 20 images, I manually corrected the labels and used them to fine tune a new model. That worked well but still it took me several hours to edit labels.

I tried using stable diffusion generated labels several weeks ago but I think with controlnet and other advances I should try again.

(My dataset is about 100k images. I probably only need to label about 10k to fine tune DIS).

arduinomancer · on April 5, 2023

This exists as a feature on iOS

You can long press on an image and it cuts out whatever thing it thinks you're pressing on

They also use it in interesting ways, like making stuff in the photo slightly overlap the clock on the lockscreen

Does anyone know if that works the same way as this?

neom · on April 5, 2023

This is: understand everything in the image as elements, subject or whatever.

hbn · on April 5, 2023

It would still be nice if iOS had some kind of interface like this where you can nudge it in the right direction if it's confusing something like a jacket and the background. iOS gives its best attempt which is usually pretty good, but if it didn't get it right you're basically SOL.

fzliu · on April 5, 2023

Computer vision seems to be gravitating heavily towards self-attention. While the results here are impressive, I'm not quite convinced that vision encoders are the right way forward. I just can't wrap my head around how discretizing images, which are continuous in two dimensions, into patches is the most optimal way to do visual recognition.

What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder stack? I fee like the results would be similar if not better.

EDIT: Clarifying that encoder/decoder refers to the transformer stack, not an autoencoder.

skybrian · on April 6, 2023

Google seems to be doing it all with transformers. It's not open source, though:

> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

https://ai.googleblog.com/2023/03/scaling-vision-transformer...

neodypsis · on April 5, 2023

> What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder? I fee like the results would be similar if not better.

You mean like in an U-Net architecture?

intalentive · on April 6, 2023

IMO optimal visual recognition should be sensorimotor-based and video-first. In the real world, action and perception are intertwined. Supervised training on static pixel arrays seems backward and primitive.

sorenjan · on April 6, 2023

MLP-Mixer uses only multi-layer perceptrons. It was released in 2021, but ReBotNet that was released this year uses mixer layers too.

Still uses patches though, but they're mixing data between patches.

jacooper · on April 6, 2023

> The model is licensed under the Apache 2.0 license.

Finally, an actually open model

novaRom · on April 6, 2023

and it will work on any smartphone because:

    The image encoder has 632M parameters.

    The prompt encoder and mask decoder have 4M parameters.

neom · on April 5, 2023

yikes. I went to film school in the early 2000s and spent hours and hours on levels/HDR based masking, I've used the adobe tools recently and they're good... this is... yikes...I wonder how people in their mid 20s today learning photoshop are going to deal with their graduating jobs.

jacquesm · on April 5, 2023

Not. This is homing in on the SF UI that Deckard used in Blade Runner.

https://www.youtube.com/watch?v=hHwjceFcF2Q

All it takes is a couple of tools glued together and you're getting there.

zestyping · on April 6, 2023

The most unrealistic part of this sequence is that the printer works on the first try.

marstall · on April 5, 2023

"gimme a hardcopy right there."

ChatGTP · on April 6, 2023

Don’t you get it, no one is supposed to have a job? Silicon Valley is going to buy us all the things we need ?

cloudking · on April 5, 2023

Pretty cool, Runway has a similar green screening feature that can 1-click segment a subject from the background across an entire video: https://runwayml.com/ai-magic-tools/

minimaxir · on April 5, 2023

You know an AI project is serious when it has its own domain name instead of a subdomain.

AdilZtn · on April 5, 2023

That's amazing! This model is a huge opportunity to create annotated data (with decent quality) for just a few dollars. People will iterate more quickly with this kind of foundation model.

ericmathison · on April 6, 2023

This reminds me of https://cleanup.pictures/ which actually removes objects from the original.

tgv · on April 6, 2023

Impressive, but not really perfect, is it? In sa_10016721, it misses the obvious, and in sa_10020386, it misses most of the train in the center, and a bunch of the parked cars (pretty random). In sa_10179757, it labels 3 out of 4 letters of the shipping company's name (?), and a handful of windows, and while perhaps it sees the ship as one piece, the people in the foreground are split in many parts.

fn-mote · on April 6, 2023

Kind of off topic, but: I've never seen such crappy issues filed in a repo [1].

I don't read issues for major repositories, so perhaps this is standard? There are a ton of one line "issues", no clear example, test case, attempt to debug, not even a pull request for the one that points out a typo in the README.

Gross. It seems like none of these issues will never be read because they are going to drown in garbage.

[1] https://github.com/facebookresearch/segment-anything/issues

MF-DOOM · on April 6, 2023

Since anyone can open an issue, they’re bound to be low quality unless moderated.

kristopolous · on April 7, 2023

some trackers handle this by asking pointed questions. "Steps to reproduce" "platforms affected" "expected behavior" etc.

Microsoft knows this. Get on the ball, guys.

dang · on April 5, 2023

code51 · on April 5, 2023

It’s interesting that (clearly visible) text parts that cannot be handled properly by most OCR approaches also get left out by SAM in auto-predictions.

syntheweave · on April 5, 2023

Finally, I'll be able to fill line art with flat colors without fussing around with thresholds and painting in boundaries.

(It does have difficulty finding the smallest possible area, but it's a significant advance over most existing options since in my brief test, it can usually spot the entire silhouette of figures, which is where painting a boundary is most tedious).

an_aparallel · on April 6, 2023

Ahh - finally - we can now hyperlink real life. Perfect. HTRL (Hypertext Reality Language)

ren_engineer · on April 5, 2023

what do you think facebook's gameplan is here? Are they trying to commoditize AI by releasing this and Llama as a move against OpenAI, Microsoft, and Google? They had to have known the Llama weights would be leaked and now they are releasing this

dragonwriter · on April 5, 2023

I think cranking out open source projects like this raises Meta AI’s profile and helps them attract attention and people, and I don’t think selling AI qua AI is their business plan, selling services built on top is. And commoditized AI means that the AI vendors don’t get to rent-seek on people doing that, whereas narrowly controlled monopoly/oligopoly AI would mean that the AI vendors extract the value produced by downstream applications.

vagabund · on April 5, 2023

I've always half-believed that the relatively open approach to industry research in ML was a result of the inherent compute-based barrier to entry for productizing a lot of the insights. Collaborating on improving the architectural SoTA gets the handful of well-capitalized incumbents further ahead more quickly, and solidifies their ML moat before new entrants can compete.

Probably too cynical, but you can potentially view it as a weak form of collusion under the guise of open research.

dragonwriter · on April 5, 2023

This particular model has a very low barrier; the model size is smaller than Stable Diffusion which is running easily on consumer hardware for inference, though training is more resource intensive (but not out of reach of consumers, whether through high-end consumer hardware or affordable cloud resources.)

For competitive LLMs targeting text generation, especially for training, a compute-based barrier is more significant.

vagabund · on April 5, 2023

Yeah that’s fair. I intended my comment to be more of a reflection on the culture in general, but the motivations in this instance are probably different.

sangnoir · on April 6, 2023

> Probably too cynical, but you can potentially view it as a weak form of collusion under the guise of open research.

I think that argument falters when the weights are released, which lowers the barrier by a lot as training of large models is much more expensive than inferences. A weak form of collusion would be publishing papers that explain enough for the practitioners to fill in the gaps (so casuals are left out) and not publishing the weights so only other large companies can afford to implement and train their versions of models.

My own view is that open-publishing in AI is mostly bottom-up, and the executives tolerate open publishing for the reasons you gave.

Incidentally most companies won't publish their crown jewels i.e. Camera apps on Google and Apple phones had great segmentation on the usual photography subjects, would rather not publish them. I'm not holding my breath for video Recommendation models from TikTok or Facebook either

_the_inflator · on April 5, 2023

I think Meta's gameplan is complex. Inspiration as well as adoption, not stepping on the toes of regulators prolly another intention. Have a look at PyTorch for example. Massively popular ML framework, with its lots of interesting projects running.

If Meta frequently shares their "algorithms" they take the blame out of its usage. After all, who is to blame when everybody does "it" and you are very open about it.

Use cases, talent visibility as well as attraction also plays a role. After all, Google was so fancied, due to its many open source projects. "Show, don't tell".

jayd16 · on April 5, 2023

Well there's some patent offense and defense in making and releasing research papers. There's some recruiting aspects to it. Its also a way to commoditize your inverse if you assume this sort of stuff brings AR and the metaverse closer to reach.

herval · on April 5, 2023

Their main use case for these models seems to be AR. Throwing it out in the open might help getting external entities to build for them & attract talent, etc. Not sure they’re that strategic but it’s my guess

high_derivative · on April 5, 2023

n=1 (as a mid-profile AI researcher), but for me it's working in terms of Meta gaining my respect by open sourcing (despite the licensing disasters). They clearly seem to be more committed to open source and getting things done now in general.

bobsmooth · on April 5, 2023

Seeing it run on a headset is the coolest part. Lots of applications for AR.

subarctic · on April 5, 2023

The demo is pretty cool but it looks like you can just select things and have it highlight them in blue - is there a way to remove objects from the image and have the background filled in behind them?

dymk · on April 5, 2023

You could probably content-aware fill the area that SAM identifies with another tool

exodust · on April 6, 2023

Yeh I did that, but it's not a great experience.

For some reason this tool makes a slightly smaller mask than is found in the original image. So when you copy the masked area back into Photoshop, it doesn't match. Almost there, but not quite.

bfung · on April 6, 2023

Def. cool demo.

As a polished product & ease of use, still has ways to go.

Apple’s cut&paste of subjects in photos, while lacking SAM’s generality, is masterfully executed - even my 70yr old mom can use it.

retox · on April 6, 2023

The TV/VHS example image does not do well.

richardw · on April 5, 2023

Surely this changes the security camera game? No more being fooled by clouds going overhead.

contingencies · on April 7, 2023