This is awesome.
If you try the demo they provide [0], the inference is handled purely in the client using a ONNX model that only weights around ~8Mb [1] [2].
Really impressive stuff! Congrats to the team that achieved it
> What platforms does the model use?
> The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
> The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.
You can download the model yourself on GitHub and run it locally. The biggest one is about 2.5GB and certainly took some time on my M1 CPU. I couldn't get mps to run as the tensor dtypes are incompatible (could be a quick fix).
The small ONNX model just decodes the output of the larger model into the masks etc. But the bulk of the "computation" is done by a much larger vision transformer somewhere else. It really needs a GPU with a fair amount of memory to run anywhere close to real-time.
> The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
WebGPU just shipped today in Chrome, there are no people reporting that the demo doesn't work with their days-old browser, so it doesn't use WebGPU.
While it's possible, without WebGPU it's really tedious to run NN in the browser.
Also, the model is implemented in PyTorch and wasn't converted to other model format for other runtime. While technically you can compile CPython and PyTorch to WASM and run the duo in browser, there are definitely no GPU access.
Given that they explicitly mentioned the decoder was converted to ONNX, it's obvious this isn't done for the encoder and they really mean PyTorch, running with Python, on a server.
Okay, so you browser can't run the encoder, yet the web demo works, it's quite obvious on which server the encoder run.
I downloaded the code from their repo, exported their pytorch model to onnx, and ran a prediction against it. Everything ran locally on my system (cpu, no cuda cores) and a prediction for the item to be annotated was made.
Wow, this is pretty epic. I put it through its paces on a pretty wide variety of images that have tripped up recent zero-shot models[1] and am thoroughly impressed.
We have a similar "Smart Polygon" tool[2] built into Roboflow but this is next level. Having the model running in the browser makes it so much more fun to use. Stoked it's open source; we're going to work on adding it to our annotation tool ASAP.
The SAM model is small (4m params) , but requires image embedding to be computed from what is I think a 600m params model. Right now the demo uploads the image to get the embeddings, then runs the actual segmentation locally.
That paper link is a CDN URL that is dynamically generated to point to your closest POP when you load the abstract. It will be different for many people and will break eventually.
If I'm remembering correctly, people were accessing the original versions of the images uploaded to instagram (removing filters, and even masks) using parameter engineering (We love putting engineering at the end of everything that requires more than 5 seconds of thought these days, so why not). That could be why.
> Don't you run the risk of only one person knowing the password to the password manager?
No, because systems don't necessarily work that way. For us, the boundaries between our members aren't totally uncrossable. Information has gotten through in the past when it would be especially important or needed.
Though I guess it's funny that this topic comes up under a post called "segment anything". I guess our brain did that~
Multiple block diagrams and the paper note that one of the inputs is supposed to be "text", but none of the example Jupyter notebooks or the live demo page show how to use those. I'm assuming just run the text into CLIP, take the resulting embedding, and throw it directly in as a prompt, which then gets re-encoded by the SAM prompt encoder?
> "Prompt encoder. We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82]. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding."
The network architecture and scale don't seem to be a big departure from recent SOTA, but a pretty massive amount of labeled data went into it. And it seems to work pretty well! The browser demo is great. This will probably see a lot of use, especially considering the liberal licensing.
I apologize if this is obvious, but are both the model and checkpoint (as referenced in getting started section in readme) Apache 2.0? Can it be used for commercial applications?
As far as I can tell, it can. The code itself has a `LICENSE` file with the Apache license, and the readme says "The model is licensed under the Apache 2.0 license.". Strangely, the FAQ in the blog post doesn't address this question, which I expect will be frequent.
Apache 2 is just about as business friendly as you can get. It's:
* Do what you want
* Don't sue us
* You license any patents you control and used in this work. If you sue someone for patent violation for using this then other entities can counter sue you for violating any of their patents used in this work.
There is no viral nature, and it is older than GPLv3.
It's most simialr to BSD of the licenses you list.
LGPL is not business friendly at all. It's among the least business friendly licenses there is. Apache 2.0 is slightly more business friendly than BSD.
With some caveats, software licenses from most to least business friendly roughly go:
LGPL is more business friendly than GPL; it's literally "lesser" GPL.
You can use LGPL in commercial, closed-source projects as long as you keep the LGPL code in a separate dynamically linked library, e.g. a DLL, and provide a way for users to swap it out for their own patched DLL if they wish. (Plus some other license terms.)
Also, you can always use LGPL code under the terms of the GPL, so there's no way LGPL is more restrictive than GPL.
Beware that you may need to be careful using LGPL code in a browser: JavaScript is source code not object code, arguing WASM is a DLL wouldn’t help, most JavaScript minifiers perform static linking, and sending LGPL code to the browser could be considered distribution. I always avoided all LGPL licensed libraries when doing commercial front-end work.
They seem to avoid using their own brand a lot. They have a zillion domain names and they register a new one and don't use the logo except in the favicon and footer. I've seen similar stuff including divesting OSS projects like PyTorch and GraphQL which Google wouldn't. To me that's tacit admission that the Facebook and Meta names are tarnished. And they are, by the content they showed users in Myanmar with the algorithmic feed, and by Cambridge Analytica. Maybe the whole "Meta" name is no different from the rebranding of Philip Morris.
Welcome to the wild world of corporate IT. Their VP has authority to make a new website if she wants, but has to go through a 3 month vetting process to put on a subdomain.
As someone who used to work on Facebook open source, that makes sense! After all, an insecure subdomain could lead to all sorts of problems on facebook.com. Phishing, stealing cookies, there's a lot of ways it could go wrong.
Whereas, if one engineer spins up some random static open source documentation website on AWS, it really can't go wrong in a way that causes trouble for the rest of the company.
And you would learn that if you don't have wildcard cookies, which I generally wouldn't recommend, subdomains are isolated from each other. But with meta if the brand weren't tarnished, a new domain for subdomains like Google's withgoogle.com and web.dev would be a good place to add sites like this rather than subdomain.facebook.com
Meta isn't a typical corporation, though. Ordinary big company red tape could have stopped them from indirectly displacing thousands based on their religion. (That isn't an outlandish claim but is something they actually got sued for, though it was dismissed without absolving them of it)
It very much is a typical big corp, and OP is correct. It's easier to ship something on a new domain, using AWS and a bunch of contractors, than to add a subdomain to facebook.com or some other top-level domain
Not to mention, the "Ordinary big company red tape" didn't stop Coca Cola from hiring Colombian death squads, Nestle from draining the Great Lakes and selling it back to it's residents, nor Hershey's from making chocolate from cacao farmed with child slave labor.
Relative to the rest of FAANG (or even Fortune 500), Facebook might have the least blood on their hands when everything is said and done.
um... did you sleep through the last 8+ years of handwringing about election interference, Russian / state propaganda, live streaming massacres, addiction / mental health effects of social media, particular for kids? I can't imagine the other FAANGs come close
If platforming disinformation and enabling internet addiction is equivalent to criminal complacency, then Microsoft, Apple, Amazon and Google all have crimes to answer for. Facebook has shit the bed more times than they can count on two hands, but unfortunately that's kinda the table-stakes in big tech.
I actually have a much more positive impression of Meta because of this work. It's hard to describe, but they feel very competent. My instant reaction to something being by Meta Research is actually to think it's probably going to be interesting and good.
What are you talking about?
There is a Meta Logo Favicon, "Meta AI" appears in the header and "Meta AI" is purposefully centered in the ABF text. Registering a new domain costs $10 compared to the massive pain of involving legal with the permissions to repurpose a new domain. It's a new project so why not make a clean start and just get a new website instead of going through the full FB/Meta approval process on branding.
I mentioned the logo. I didn't mention the text because perhaps they still want to score points for Meta, so hiding it entirely wouldn't make sense. But they avoid the larger immediate hangups of the big logo and the domain name.
On the one hand, sure. Facebook's brand is about as hip as a bag of Werther's Originals.
On the other hand, this is one of those things (like VR) that is a distinctly non-Facebook project. It makes no sense to position or market this as "Facebook" research. The Homepod isn't called the iPod Home for obvious reasons, so it stands to reason that Facebook execs realized selling someone a "Facebook Quest" sounds like a metaphor for ayahuasca. It's not entirely stupid to rebrand, especially considering how diverse (and undeniably advanced) they've become in fields like AI and VR.
Ever used React or PyTorch? Well, this is same. Developers make good stuff regardless of where they work, and good on FB for contributing
But yeah if you do open source adding an element of corporate branding is a sure way to kill the project. That's why it's not called "Apple Swift" or "Microsoft TypeScript".
Yeah, me too. I also avoid everything Apple and Google makes, but I'm not going to pretend like the Alphabet rebranding is their attempt at hiding who they are.
Alphabet wasn't a rebranding: the founding billionaires got bored of Google, and wanted to take out a few billion dollars per year out of it to create new toys without sharing it with Google.
See my other comment. Of course they needed to have it somewhere to score points. These probably weren't people who were about to quit it, probably just with a lowered perception of it compared to a company people are mostly proud to work at like Google... https://news.ycombinator.com/edit?id=35458445
Beginning in August 2017, the Myanmar security forces undertook a brutal campaign of ethnic cleansing against Rohingya Muslims. This report is based on an in-depth investigation into Meta (formerly Facebook)’s role in the serious human rights violations perpetrated against the Rohingya. Meta’s algorithms proactively amplified and promoted content which incited violence, hatred, and discrimination against the Rohingya – pouring fuel on the fire of long-standing discrimination and substantially increasing the risk of an outbreak of mass violence. The report concludes that Meta substantially contributed to adverse human rights impacts suffered by the Rohingya and has a responsibility to provide survivors with an effective remedy.
The Magic Wand tool mostly just selects similarly colored pixels based on a simple algorithm. The Object Selection, Select Subject and Sky Replacement tools use AI detection and can be configured to run locally or on Adobe's GPUs. Having played with this demo, they seem in a similar league to me.
I think you haven’t played around enough with it, you can prompt it to segment literally anything in an image. Not just regions of similar texture, it understands humans, dogs, cats etc
I have a small number of naive questions. I already have a fine-tuned tardigrade detection model that gives me tardigrade bounding boxes (data comes from labelled images on my microscope). I want to do masks as well.
Right now my home server w/ an RTX 2080 is able to do mask prediction in about ~4 seconds (I'm running the sample script in "directory" mode) per image (640x480).
I'd love to be able to get the first mask back in 0.1ms, so I can do 10FPS on the scope. Is there a practical way to speed things up (my guess would be buying the absolute fastest GPU I can afford)? I can run the object detection to get a reasonable seed location, is that what the paper means by prompting?
A 4s run time for object segmentation at 640x480 sounds like it's not using the GPU at all. Something like that should run on a VGA image in at most a few hundred ms.
For the second part of the question, a 2080 should get you close to 10FPS operation. For a ballpark estimate, using an off-the-shelf repo like Ultralytics's YOLOv5 lets you run object detection (not masking) at something like 100FPS. Masking should not add that much overhead.
w.r.t. GPUs yes, these days more money equals more speed for GPU NN inference, though there are diminishing returns. A 3090 might get you the best bang for your buck these days while still having enough VRAM to run fancier models which may need more than the 12 GiB many other GPUs have.
Finally, I haven't read the paper too carefully but I believe that by prompting they mean that you have the option of describing in human language what you want the model to select, rather than the model being "hardwired" to do this. In other words, you could prompt the model to "segment the red car only" and it would do it, rather than just having the model blindly segment every object in the image, and then relying on custom scripting to potentially post-process these segments.
I'm using the first model on the SAM website ( ViT-H SAM).
It's definitely using the GPU- I'm running nvidia-smi and I see near 100% utilization on the GPU while the CPU is using 1 core. If I run the script with --device=cpu then I see my server using 4 cpu cores and no GPU and it takes tens o seconds per image.
I'm trying to check with people who have experience with this specific model.
I've checked on the repo and other folks report the same numbers as me- in fact my 2080 is just as fast as the 3090 (5 seconds per mask generation/image).
I have bounding box already, so I could prompt the model with that, but all of this runs counter to the published performance numbers.
If you want to try it on one reach out to me (email in profile). We rent those out in the cloud. Would allow you to confirm performance before buying one for local use.
the RTX 2080ti 11GB model is a little more than 2 times slower than the flagship RTX (12.5it/s vs 28it/s) for torch/diffusion. extrapolate that from what you will.
however, CLIP/BLIP and boxes should be much faster than 4 seconds, even on a 2080. I had a python CLIP CSV tag script running in a directory with thousands of images and it was taking <=2 seconds per image on a Geforce GTX 1070 TI, with 8GB of memory - an old card without any tensors. CLIP is much slower than some other mechanisms, for instance a deepbooru classifier is about 4x faster than CLIP/BLIP on my RTX 3060 12GB. CLIP of a random image around your dimensions takes ~4-5 seconds, and deepbooru takes about 1.5 seconds. edit: the additional time is the overhead of the webUI, i am guessing
What will probably have to happen is some sort of auto-crop that only forces the model to view a very tiny section of the image. You mentioned you already had a model, was it trained from scratch, or using an existing model?
What model are you using? 4s/image seems extremely slow. I've been experimenting with Detectron2 and most of their models give me less than 1s on the CPU for instance segmentation on images 4x the size
It seems like the output of this model is masks, but for cropping you really need to be able to pull partial color out of certain pixels (for example, pulling a translucent object out from of a colored background). I tried the demo, and it fails pretty miserably on a vase. Anyone know of a model that can do this well?
This is a segmentation model, it 'just' creates segmentation masks. You then take the mask and cut out whatever you need from the image itself. If you need to figure out how to remove a transparentency that is another whole problem, probably a generative problem (generate a new texture for the vase).
The main issue I have with DIS is that creating the labels of my own dataset is super expensive (I think it might be easier to generate the training data using stable diffusion rather than human labelling)
It is related to subpixel labelling. When a line/curve in the foreground is smaller than a pixel you end up having to edit the mask one pixel at a time. The authors of DIS are working on a new dataset and model which should work for my use case.
BTW, I used DIS to create the labels of a batch of 20 images, I manually corrected the labels and used them to fine tune a new model. That worked well but still it took me several hours to edit labels.
I tried using stable diffusion generated labels several weeks ago but I think with controlnet and other advances I should try again.
(My dataset is about 100k images. I probably only need to label about 10k to fine tune DIS).
It would still be nice if iOS had some kind of interface like this where you can nudge it in the right direction if it's confusing something like a jacket and the background. iOS gives its best attempt which is usually pretty good, but if it didn't get it right you're basically SOL.
Computer vision seems to be gravitating heavily towards self-attention. While the results here are impressive, I'm not quite convinced that vision encoders are the right way forward. I just can't wrap my head around how discretizing images, which are continuous in two dimensions, into patches is the most optimal way to do visual recognition.
What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder stack? I fee like the results would be similar if not better.
EDIT: Clarifying that encoder/decoder refers to the transformer stack, not an autoencoder.
Google seems to be doing it all with transformers. It's not open source, though:
> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.
> What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder? I fee like the results would be similar if not better.
IMO optimal visual recognition should be sensorimotor-based and video-first. In the real world, action and perception are intertwined. Supervised training on static pixel arrays seems backward and primitive.
yikes. I went to film school in the early 2000s and spent hours and hours on levels/HDR based masking, I've used the adobe tools recently and they're good... this is... yikes...I wonder how people in their mid 20s today learning photoshop are going to deal with their graduating jobs.
Pretty cool, Runway has a similar green screening feature that can 1-click segment a subject from the background across an entire video: https://runwayml.com/ai-magic-tools/
That's amazing! This model is a huge opportunity to create annotated data (with decent quality) for just a few dollars. People will iterate more quickly with this kind of foundation model.
Impressive, but not really perfect, is it? In sa_10016721, it misses the obvious, and in sa_10020386, it misses most of the train in the center, and a bunch of the parked cars (pretty random). In sa_10179757, it labels 3 out of 4 letters of the shipping company's name (?), and a handful of windows, and while perhaps it sees the ship as one piece, the people in the foreground are split in many parts.
Kind of off topic, but: I've never seen such crappy issues filed in a repo [1].
I don't read issues for major repositories, so perhaps this is standard? There are a ton of one line "issues", no clear example, test case, attempt to debug, not even a pull request for the one that points out a typo in the README.
Gross. It seems like none of these issues will never be read because they are going to drown in garbage.
It’s interesting that (clearly visible) text parts that cannot be handled properly by most OCR approaches also get left out by SAM in auto-predictions.
Finally, I'll be able to fill line art with flat colors without fussing around with thresholds and painting in boundaries.
(It does have difficulty finding the smallest possible area, but it's a significant advance over most existing options since in my brief test, it can usually spot the entire silhouette of figures, which is where painting a boundary is most tedious).
what do you think facebook's gameplan is here? Are they trying to commoditize AI by releasing this and Llama as a move against OpenAI, Microsoft, and Google? They had to have known the Llama weights would be leaked and now they are releasing this
I think cranking out open source projects like this raises Meta AI’s profile and helps them attract attention and people, and I don’t think selling AI qua AI is their business plan, selling services built on top is. And commoditized AI means that the AI vendors don’t get to rent-seek on people doing that, whereas narrowly controlled monopoly/oligopoly AI would mean that the AI vendors extract the value produced by downstream applications.
I've always half-believed that the relatively open approach to industry research in ML was a result of the inherent compute-based barrier to entry for productizing a lot of the insights. Collaborating on improving the architectural SoTA gets the handful of well-capitalized incumbents further ahead more quickly, and solidifies their ML moat before new entrants can compete.
Probably too cynical, but you can potentially view it as a weak form of collusion under the guise of open research.
This particular model has a very low barrier; the model size is smaller than Stable Diffusion which is running easily on consumer hardware for inference, though training is more resource intensive (but not out of reach of consumers, whether through high-end consumer hardware or affordable cloud resources.)
For competitive LLMs targeting text generation, especially for training, a compute-based barrier is more significant.
Yeah that’s fair. I intended my comment to be more of a reflection on the culture in general, but the motivations in this instance are probably different.
> Probably too cynical, but you can potentially view it as a weak form of collusion under the guise of open research.
I think that argument falters when the weights are released, which lowers the barrier by a lot as training of large models is much more expensive than inferences. A weak form of collusion would be publishing papers that explain enough for the practitioners to fill in the gaps (so casuals are left out) and not publishing the weights so only other large companies can afford to implement and train their versions of models.
My own view is that open-publishing in AI is mostly bottom-up, and the executives tolerate open publishing for the reasons you gave.
Incidentally most companies won't publish their crown jewels i.e. Camera apps on Google and Apple phones had great segmentation on the usual photography subjects, would rather not publish them. I'm not holding my breath for video Recommendation models from TikTok or Facebook either
I think Meta's gameplan is complex. Inspiration as well as adoption, not stepping on the toes of regulators prolly another intention. Have a look at PyTorch for example. Massively popular ML framework, with its lots of interesting projects running.
If Meta frequently shares their "algorithms" they take the blame out of its usage. After all, who is to blame when everybody does "it" and you are very open about it.
Use cases, talent visibility as well as attraction also plays a role. After all, Google was so fancied, due to its many open source projects. "Show, don't tell".
Well there's some patent offense and defense in making and releasing research papers. There's some recruiting aspects to it. Its also a way to commoditize your inverse if you assume this sort of stuff brings AR and the metaverse closer to reach.
Their main use case for these models seems to be AR. Throwing it out in the open might help getting external entities to build for them & attract talent, etc. Not sure they’re that strategic but it’s my guess
n=1 (as a mid-profile AI researcher), but for me it's working in terms of Meta gaining my respect by open sourcing (despite the licensing disasters). They clearly seem to be more committed to open source and getting things done now in general.
The demo is pretty cool but it looks like you can just select things and have it highlight them in blue - is there a way to remove objects from the image and have the background filled in behind them?
For some reason this tool makes a slightly smaller mask than is found in the original image. So when you copy the masked area back into Photoshop, it doesn't match. Almost there, but not quite.
Demo is running slow - cutting out is an impressive ability - I'm to assume it also fills in the background? If so: that's next level. Maybe that Photoshop monthly subscription will be worth it (providing this sort of ability is going to be baked in with AdobeAI's version soon)
It doesn't fill in the background and has nothing to do with Adobe.
You could bring the cut-outs back into Photoshop. I tried that, but this SAM tool reduces the size of the cut-outs slightly, so the cut-out won't match original image dimensions.
I know. Masking and selecting is something you do in Adobe products and Adobe will be coming out with their own version of this (if I were a betting man)
Really impressive stuff! Congrats to the team that achieved it
[0] https://segment-anything.com/demo
[1] https://segment-anything.com/model/interactive_module_quanti...
[2] https://segment-anything.com/model/interactive_module_quanti...