Multiple block diagrams and the paper note that one of the inputs is supposed to...

Multiple block diagrams and the paper note that one of the inputs is supposed to be "text", but none of the example Jupyter notebooks or the live demo page show how to use those. I'm assuming just run the text into CLIP, take the resulting embedding, and throw it directly in as a prompt, which then gets re-encoded by the SAM prompt encoder?

> "Prompt encoder. We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82]. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding."

Edit: Found the answer myself: https://github.com/facebookresearch/segment-anything/issues/...