We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., “grass” or “building”) together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space.
2022: Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, V. Koltun, René Ranftl
Ranked #2 on Few-Shot Semantic Segmentation on FSS-1000
https://arxiv.org/pdf/2201.03546v1.pdf