{"id":1071,"date":"2023-07-15T14:43:41","date_gmt":"2023-07-15T06:43:41","guid":{"rendered":"http:\/\/SmokeyDays.top\/wordpress\/?p=1071"},"modified":"2023-07-15T14:46:44","modified_gmt":"2023-07-15T06:46:44","slug":"segment-anything-model-%e9%98%85%e8%af%bb%e7%ac%94%e8%ae%b0","status":"publish","type":"post","link":"http:\/\/SmokeyDays.top\/wordpress\/2023\/07\/15\/segment-anything-model-%e9%98%85%e8%af%bb%e7%ac%94%e8%ae%b0\/","title":{"rendered":"Segment Anything Model \u9605\u8bfb\u7b14\u8bb0"},"content":{"rendered":"\n\n    \n    \n        <meta charset=\"UTF-8\">\n        <title>Segment Anything Model(SAM) \u9605\u8bfb\u7b14\u8bb0<\/title>\n        <style>\n\/* From extension vscode.github *\/\n\/*---------------------------------------------------------------------------------------------\n *  Copyright (c) Microsoft Corporation. All rights reserved.\n *  Licensed under the MIT License. See License.txt in the project root for license information.\n *--------------------------------------------------------------------------------------------*\/\n\n.vscode-dark img[src$=\\#gh-light-mode-only],\n.vscode-light img[src$=\\#gh-dark-mode-only] {\n\tdisplay: none;\n}\n\n<\/style>\n        \n        <link rel=\"stylesheet\" href=\"https:\/\/cdn.jsdelivr.net\/gh\/Microsoft\/vscode\/extensions\/markdown-language-features\/media\/markdown.css\">\n<link rel=\"stylesheet\" href=\"https:\/\/cdn.jsdelivr.net\/gh\/Microsoft\/vscode\/extensions\/markdown-language-features\/media\/highlight.css\">\n<style>\n            body {\n                font-family: -apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', system-ui, 'Ubuntu', 'Droid Sans', sans-serif;\n                font-size: 14px;\n                line-height: 1.6;\n            }\n        <\/style>\n        <style>\n.task-list-item {\n    list-style-type: none;\n}\n\n.task-list-item-checkbox {\n    margin-left: -20px;\n    vertical-align: middle;\n    pointer-events: none;\n}\n<\/style>\n        \n    \n    \n<h2 id=\"\u6a21\u578b\u67b6\u6784\">\u6a21\u578b\u67b6\u6784<\/h2>\n<p>\u6a21\u578b\u4e3b\u8981\u7531\u4e09\u4e2a\u90e8\u4efd\u7ec4\u6210\uff0c\u4e00\u4e2a\u7b28\u91cd\u7684 image encoder \uff0c\u7528\u4e8e\u5c06\u56fe\u50cf embedding\uff1b\u4e00\u4e2a\u8f7b\u91cf\u7ea7\u7684 prompt encoder \u548c\u4e00\u4e2a\u8f7b\u91cf\u7ea7\u7684 mask decoder\uff0c<\/p>\n<h3 id=\"image-encoder\">Image Encoder<\/h3>\n<h4 id=\"vision-transformervit\">Vision Transformer(ViT)<\/h4>\n<p>\u8fd9\u4e2a\u90e8\u4efd\u91c7\u7528\u4e86 NLP \u9886\u57df\u5e38\u7528\u7684 Transformer \u601d\u60f3\u3002<\/p>\n<p>\u8f93\u5165\u56fe\u7247\u9996\u5148\u7ecf\u8fc7 PatchEmbed \u6a21\u5757\u3002\u5728\u8fd9\u91cc\uff0c\u5c06\u56fe\u7247\u5207\u5272\u6210 16 * 16 \u4e2a patch, \u6bcf\u4e2a patch \u7684\u7ef4\u5ea6\u662f 768.<\/p>\n<p>\u7136\u540e\uff0c\u5982\u679c\u542f\u7528 absolute positional embeddings, \u4e5f\u5c31\u662f\u4f4d\u7f6e\u4fe1\u606f\uff0c\u90a3\u4e48\u76f4\u63a5\u5c06\u5b83 sum \u5230 patch embedding \u4e0a\u3002<\/p>\n<p>\u63a5\u4e0b\u6765\u901a\u8fc7\u4e00\u7ec4 depth \u4e2a\u7684 transformer blocks. \u8fd9\u91cc\u4f7f\u7528\u4e86 Multihead Attention\uff0c\u7136\u540e\u518d\u7ecf\u8fc7\u4e00\u4e2a\u6fc0\u6d3b\u51fd\u6570\u662f GeLU \u7684 MLP \u3002\n\u8fd9\u91cc\u5b83\u4f7f\u7528\u4e86 Window Attention \uff0c\u4e5f\u5c31\u662f\u6bcf\u6b21\u81ea\u6ce8\u610f\u529b\u53ea\u5173\u6ce8\u4e00\u4e2a\u5c40\u90e8\u3002\u8bba\u6587\u79f0\u5b83\u4f7f\u7528\u4e86 14 * 14 \u7684 Window\u3002\u540c\u65f6\u8fd8\u52a0\u4e0a\u4e86 relative positional embeddings\u3002<\/p>\n<h4 id=\"reduce-channel-dimension\">Reduce Channel Dimension<\/h4>\n<p>\u5728\u8fd9\u91cc\u5b83\u53c2\u7167\u4e86<a href=\"https:\/\/arxiv.org\/abs\/2203.16527\">Exploring Plain Vision Transformer Backbones for Object Detection<\/a>\uff0c\u628a\u8f93\u51fa\u5148\u540e\u901a\u8fc7 1 * 1, 256 channel \u548c 3 * 3, 256 channel \u7684\u5377\u79ef\u538b\u7f29\u3002\u6bcf\u6b21\u5377\u79ef\u540e LayerNorm \u4e00\u6b21\u3002<\/p>\n<h3 id=\"prompt-encoder\">Prompt Encoder<\/h3>\n<p>\u5c06\u8f93\u5165\u5206\u6210\u4ee5\u4e0b\u56db\u79cd\uff1a<\/p>\n<ul>\n<li>\u4e00\u4e2a\u70b9\uff1a\u5c06 positional encoding \u548c\u4e00\u4e2a\u8868\u793a\u5b83\u662f\u5728\u524d\u666f\u8fd8\u662f\u80cc\u666f\u7684 embedding \u76f8\u52a0\u3002<\/li>\n<li>\u4e00\u4e2a\u6846\uff1a\u5206\u522b\u7528\u4e24\u4e2a embedding \u8868\u793a\u5de6\u4e0a\u89d2\u548c\u53f3\u4e0b\u89d2\u3002<\/li>\n<li>mask\uff1a\u53ef\u80fd\u662f\u7528\u4e8e\u8bad\u7ec3\uff0c\u76f4\u63a5\u585e\u5165 <code>dense_prompt_embedding<\/code> \uff0csparse \u9879\u7f6e\u96f6\u3002<\/li>\n<li>\u65e0\u8f93\u5165\uff1a\u5355\u72ec\u7684 embedding\uff0c\u8868\u793a no prompt.<\/li>\n<\/ul>\n<p>\u5206\u522b\u4f20\u51fa\u4e24\u4e2a\u90e8\u4efd <code>sparse_prompt_embedding<\/code> \u548c <code>dense_prompt_embedding<\/code><\/p>\n<h3 id=\"lightweight-mask-decoder\">Lightweight mask Decoder<\/h3>\n<p>\u4f20\u5165\u7684\u4fe1\u606f\u5305\u62ec\u4e24\u4e2a\u90e8\u4efd\uff1a<\/p>\n<ul>\n<li>token\uff1a\u8fd9\u91cc\u5305\u62ec\u4e24\u4e2a Embedding \uff1a<code>iou_token<\/code> \u548c <code>mask_token<\/code> \uff0c\u8fd8\u6709 prompt encoder \u4f20\u51fa\u7684 <code>sparse_prompt_embedding<\/code> \uff0c\u5c06\u4ed6\u4eec concatenate \u5728\u4e00\u8d77\u3002<\/li>\n<li>src\uff1a\u8fd9\u4e2a\u90e8\u5206\u5c06 <code>image_embedding<\/code> \uff0c<code>dense_prompt_embedding<\/code> sum \u5728\u4e00\u8d77\u3002<\/li>\n<\/ul>\n<p>\u63a5\u4e0b\u6765\u5c06 src \u548c pos_src \uff08\u5206\u522b\u4ee3\u8868 prompt \u7684\u4fe1\u606f\u548c image \u7684\u4fe1\u606f\uff09\u653e\u5165 <code>TwoWayTransformer<\/code> \u3002\u5206\u522b\u7528\u4e24\u4e2a cross attention \u6765\u5904\u7406 token \u548c image \u4e4b\u95f4\u7684\u76f8\u4e92\u5173\u7cfb\u3002<\/p>\n<p>\u4e00\u4e2a upscaling\uff0c\u7136\u540e\u751f\u6210 4 \u4e2a mask token \uff1b\u540c\u65f6\u9884\u6d4b\u4e00\u4e2a IoU \uff08\u7528\u6765\u7ed9\u7ed3\u679c\u8d28\u91cf\u6392\u5e8f\uff09<\/p>\n<h3 id=\"ambiguity-aware\">Ambiguity-aware<\/h3>\n<p>\u8fd9\u4e2a\u90e8\u5206\u4e3b\u8981\u95ee\u9898\u662f\u53ef\u80fd\u4f1a\u628a\u591a\u4e2a\u6709\u6548\u8f93\u51fa\u7684 mask \u7ed9\u5e73\u5747\u6389\u3002\u300cobserve that\u300d\u7684\u5904\u7406\u65b9\u6848\u662f\u540c\u65f6\u9884\u6d4b\u5e76\u8f93\u51fa\u4e09\u4e2a mask\uff08\u4ee3\u8868\u6574\u4f53\uff0c\u90e8\u4efd\u548c\u5b50\u90e8\u4efd\uff09\uff0c\u540c\u65f6\u9884\u6d4b\u4e00\u4e2a IoU \u7ed9\u7ed3\u679c\u6392\u5e8f\uff0c\u5e76\u4e14\u53ea\u8003\u8651\u8d28\u91cf\u6700\u597d\u7684 loss \u6765\u53cd\u5411\u4f20\u64ad\u3002<\/p>\n<p>\u540c\u65f6\u5982\u679c\u7ed9\u51fa\u591a\u4e2a prompt \u7684\u8bdd\u53ea\u8fd4\u56de\u4e00\u4e2a\uff08\u591a\u4e2a prompt \u8db3\u4ee5\u786e\u8ba4\u4e00\u4e2a\u6709\u6548\u8f93\u51fa\uff09\uff0c\u4e3a\u4e86\u4e0d\u548c\u524d\u9762\u6df7\u6dc6\u603b\u5171\u9700\u8981\u751f\u6210 4 \u4e2a mask\u3002<\/p>\n<h2 id=\"\u6570\u636e\">\u6570\u636e<\/h2>\n<p>\u539f\u6587\u7528\u4e86\u5f88\u5927\u7bc7\u5e45\u6765\u89e3\u91ca\u6570\u636e\u7684\u91c7\u6837\u5728\u79cd\u65cf\uff0c\u56fd\u5bb6\u548c\u751f\u5b58\u73af\u5883\u4e0a\u7684\u591a\u6837\u6027\u3002<\/p>\n<h3 id=\"\u6570\u636e\u96c6\u751f\u6210\">\u6570\u636e\u96c6\u751f\u6210<\/h3>\n<p>\u539f\u59cb\u6570\u636e\u662f\u4ece\u67d0\u4e2a\u6444\u5f71\u516c\u53f8\u5904\u83b7\u53d6\u7684\uff0c\u540c\u65f6\u9644\u6709<\/p>\n<p>\u5927\u81f4\u5206\u6210\u4e09\u4e2a\u6b65\u9aa4<\/p>\n<ul>\n<li>Assisted-manual stage: \u8fd9\u4e2a\u9636\u6bb5\u4e3b\u8981\u7531\u6253\u6807\u4eba\u5728\u4e00\u4e2a web \u7aef\u4e0a\u7528\u4e00\u4e9b\u5de5\u5177\u6765\u6253\u6807<\/li>\n<li>Semi-auto stage: \u8fd9\u4e2a\u9636\u6bb5\u6807\u8bb0\u51fa confident masks\uff0c\u7136\u540e\u8981\u6c42\u6253\u6807\u4eba\u7ed9\u5176\u4ed6\u7684\u5bf9\u8c61\u6253\u6807<\/li>\n<li>Fully-auto stage: \u8fd9\u4e2a\u90e8\u4efd\u7528\u4e00\u4e2a 32 * 32 \u7684 point \u578b prompt \u6765\u751f\u6210\u4e00\u7ec4 masks\uff0c\u7136\u540e\u6839\u636e\u9884\u6d4b\u7684 IoU \u6765\u7b5b\u9009<\/li>\n<\/ul>\n<h3 id=\"\u4e00\u4e9b-trick\">\u4e00\u4e9b trick<\/h3>\n<ul>\n<li>\u53ea\u4fdd\u7559 confident mask \uff0c\u4e5f\u5c31\u662f IoU &gt; 88.0<\/li>\n<li>\u53bb\u9664\u8986\u76d6\u8d85\u8fc7 95% \u7684 mask\uff0c\u63d0\u5347 mask \u8d28\u91cf\uff0c\u540c\u65f6\u5904\u7406\u6389\u8fc7\u5c0f\uff08100\u50cf\u7d20\uff09\u7684 spurious holes \u548c components<\/li>\n<\/ul>\n<h2 id=\"\u8bad\u7ec3\">\u8bad\u7ec3<\/h2>\n<h3 id=\"losses\">Losses<\/h3>\n<p>\u7528 focal loss \u548c dice loss \u7684 20:1 \u7684\u7ebf\u6027\u7ec4\u5408\u6765\u76d1\u7763 mask\n\u7528 mean-square-error loss \u76d1\u7763 IoU \u9884\u6d4b\uff0cfactor \u662f 1.0<\/p>\n<h3 id=\"training-algorithm\">Training Algorithm<\/h3>\n<p>\u7b49\u6982\u7387\u9009\u62e9 foreground point \u6216\u8005 bounding box\uff0c\u7136\u540e\u52a0\u4e00\u4e9b\u6270\u52a8\u3002<\/p>\n<p>\u4e4b\u540e\u4ece\u8bef\u5dee\u533a\u57df\u91cc\u52a0\u5165\u65b0\u7684\u91c7\u6837\u70b9\u4f5c\u4e3a prompt<\/p>\n<p>\u628a\u524d\u4e00\u4ee3\u7684 mask \u4f5c\u4e3a prompt \u585e\u7ed9\u540e\u4e00\u4ee3\uff08\u8fd9\u53ef\u80fd\u662f\u524d\u9762 prompt encoder \u4e2d mask \u7684\u4f5c\u7528\uff09<\/p>\n<p>\u7531\u4e8e prompt encoder \u548c mask decoder \u7684\u5f00\u9500\u5f88\u5c0f\uff08\u4e0d\u8db3 1% \u76f8\u5bf9 image encoder\uff09\uff0c\u6240\u4ee5\u53ef\u4ee5\u652f\u6301\u591a\u6b65\u8fed\u4ee3\uff08\u8fd9\u91cc\u9009\u62e9\u4e86 1 \u6b21\u521d\u59cb\uff0c8 \u6b21\u66f4\u65b0\u91c7\u6837\u70b9\uff0c\u7136\u540e 2 \u6b21\u6ca1\u6709\u989d\u5916\u4fe1\u606f\u7684\u8fed\u4ee3\uff09<\/p>\n<h3 id=\"zero-shot-text-to-mask\">Zero-shot Text-to-Mask<\/h3>\n<blockquote>\n<p>The key observation here is that because CLIP\u2019s image embeddings are trained to align with its text embeddings, we can train with image embeddings, but use textembeddings for inference. That is, at inference time we run text through CLIP\u2019s text encoder and then give the resulting text embedding as a prompt to SAM.<\/p>\n<\/blockquote>\n<p>\u7528 CLIP \u7684 image embedding \u505a\u8bad\u7ec3\uff0c\u7528 text embeddings \u4f5c\u63a8\u7406\u3002\u5f88\u6df1\u523b\u3002<\/p>\n\n        \n        \n    \n    \n","protected":false},"excerpt":{"rendered":"<p>Segment Anything Model(SAM) \u9605\u8bfb\u7b14\u8bb0 \u6a21\u578b\u67b6\u6784 \u6a21\u578b\u4e3b\u8981\u7531\u4e09\u4e2a\u90e8\u4efd\u7ec4\u6210\uff0c\u4e00\u4e2a\u7b28\u91cd\u7684 &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/SmokeyDays.top\/wordpress\/2023\/07\/15\/segment-anything-model-%e9%98%85%e8%af%bb%e7%ac%94%e8%ae%b0\/\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">\u201cSegment Anything Model \u9605\u8bfb\u7b14\u8bb0\u201d<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[131,130,135],"tags":[],"_links":{"self":[{"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/posts\/1071"}],"collection":[{"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/comments?post=1071"}],"version-history":[{"count":3,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/posts\/1071\/revisions"}],"predecessor-version":[{"id":1076,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/posts\/1071\/revisions\/1076"}],"wp:attachment":[{"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/media?parent=1071"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/categories?post=1071"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/SmokeyDays.top\/wordpress\/wp-json\/wp\/v2\/tags?post=1071"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}