{"id":3392,"date":"2019-08-30T00:51:40","date_gmt":"2019-08-29T20:21:40","guid":{"rendered":"https:\/\/shahaab-co.ir\/mag\/?p=3392"},"modified":"2026-07-04T12:32:24","modified_gmt":"2026-07-04T09:02:24","slug":"instance-segmentation-with-mask-r-cnn-and-tensorflow","status":"publish","type":"post","link":"https:\/\/shahaab-co.com\/mag\/en-articles\/instance-segmentation-with-mask-r-cnn-and-tensorflow\/","title":{"rendered":"Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow"},"content":{"rendered":"<div id=\"fcd0\" class=\"kr kh az av au ep ks kt ku kv kw kx ky\">\n<h2 class=\"au ep kz la az\" dir=\"ltr\">Explained by building a color splash filter<\/h2>\n<\/div>\n<div class=\"lb\" dir=\"ltr\">\n<div class=\"ag af\">\n<div><span class=\"au ep bm aw bh lh bg er li et cn\"><a class=\"br bs bt bu bv bw bx by bz ca jv cd ce cf cg\" href=\"https:\/\/engineering.matterport.com\/@waleedka?source=post_page-----7c761e238b46----------------------\" target=\"_blank\" rel=\"noopener\">Waleed Abdulla<\/a><\/span><\/div>\n<div>\n<p id=\"04b2\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\">Back in November, we open-sourced our&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\" target=\"_blank\" rel=\"noopener noreferrer\">implementation of Mask R-CNN<\/a>, and since then it\u2019s been forked 1400 times, used in a lot of projects, and improved upon by many generous contributors. We received a lot of questions as well, so in this post I\u2019ll explain how the model works and show how to use it in a real application.<\/p>\n<p id=\"64b6\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\">I\u2019ll cover two things: First, an overview of Mask RCNN. And, second, how to train a model from scratch and use it to build a smart color splash filter.<\/p>\n<blockquote class=\"me mf mg\">\n<p id=\"3f65\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>We\u2019re sharing the code&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/tree\/master\/samples\/balloon\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. Including the dataset I built and the trained model. Follow along!<\/p>\n<\/blockquote>\n<h1 id=\"1d4c\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" data-selectable-paragraph=\"\">What is Instance Segmentation?<\/h1>\n<p id=\"4e7c\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" data-selectable-paragraph=\"\">Instance segmentation is the task of identifying object outlines at the pixel level. Compared to similar computer vision tasks, it\u2019s one of the hardest possible vision tasks. Consider the following asks:<\/p>\n<p data-selectable-paragraph=\"\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_-zw_Mh1e-8YncnokbAFWxg.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"alignleft wp-image-3394 size-medium\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_-zw_Mh1e-8YncnokbAFWxg-300x225.png\" alt=\"\" width=\"300\" height=\"225\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_-zw_Mh1e-8YncnokbAFWxg-300x225.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_-zw_Mh1e-8YncnokbAFWxg-768x577.png 768w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_-zw_Mh1e-8YncnokbAFWxg.png 996w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<figure class=\"nc nd ne nf ng fq gz nh db ni nj nk nl nm by de nn no np nq bd paragraph-image\">\n<div class=\"nr ns bo nt ab\">\n<div class=\"x y nb\">\n<div class=\"nx l bo ny\">\n<div class=\"bn nu hc n o hb ab bh nv nw\">\n<ul class=\"\">\n<li id=\"beac\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz od oe of\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Classification:&nbsp;<\/strong>There is a balloon in this image.<\/li>\n<li id=\"cec8\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Semantic Segmentation:<\/strong>These are all the balloon pixels.<\/li>\n<li id=\"db7f\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Object Detection:&nbsp;<\/strong>There are 7 balloons in this image at these locations. We\u2019re starting to account for objects that overlap.<\/li>\n<li id=\"a421\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Instance Segmentation<\/strong>: There are 7 balloons at these locations, and these are the pixels that belong to each one.<\/li>\n<\/ul>\n<h1 data-selectable-paragraph=\"\">&nbsp;<\/h1>\n<h1 id=\"fb1b\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" data-selectable-paragraph=\"\">Mask R-CNN<\/h1>\n<p id=\"4745\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" data-selectable-paragraph=\"\">Mask R-CNN (regional convolutional neural network) is a two stage framework: the first stage scans the image and generates&nbsp;<em class=\"mh\">proposals<\/em>(areas likely to contain an object). And the second stage classifies the proposals and generates bounding boxes and masks.<\/p>\n<p id=\"7305\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\">It was introduced last year via the&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/arxiv.org\/abs\/1703.06870\" target=\"_blank\" rel=\"noopener noreferrer\">Mask R-CNN paper<\/a>&nbsp;to extend its predecessor,&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/arxiv.org\/abs\/1506.01497\" target=\"_blank\" rel=\"noopener noreferrer\">Faster R-CNN<\/a>, by the same authors. Faster R-CNN is a popular framework for object detection, ad Mask R-CNN extends it with instance segmentation, among other things.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g-1024x461.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" width=\"1024\" height=\"461\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g-1024x461.png\" alt=\"Mask R-CNN framework\" class=\"wp-image-3398\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g-1024x461.png 1024w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g-300x135.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g-768x346.png 768w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IWWOPIYLqqF9i_gXPmBk3g.png 1285w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption> Mask R-CNN framework. Source:&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/abs\/1703.06870\" target=\"_blank\">https:\/\/arxiv.org\/abs\/1703.06870<\/a><\/figcaption><\/figure><\/div>\n\n\n<h1 id=\"5d32\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">\u06f1\u066b Backbone<\/h1>\n<figure id=\"attachment_3401\" aria-describedby=\"caption-attachment-3401\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_IDjLXsSw5QMFWDudayIBfw.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"wp-image-3401 size-medium\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_IDjLXsSw5QMFWDudayIBfw-300x262.png\" alt=\"Simplified illustration of the backbone nework\" width=\"300\" height=\"262\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IDjLXsSw5QMFWDudayIBfw-300x262.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_IDjLXsSw5QMFWDudayIBfw.png 309w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-3401\" class=\"wp-caption-text\">Simplified illustration of the backbone nework<\/figcaption><\/figure>\n<p id=\"b9da\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">This is a standard convolutional neural network (typically, ResNet50 or ResNet101) that serves as a feature extractor. The early layers detect low level features (edges and corners), and later layers successively detect higher level features (car, person, sky).<\/p>\n<p id=\"5ab7\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Passing through the backbone network, the image is converted from 1024x1024px x 3 (RGB) to a feature map of shape 32x32x2048. This feature map becomes the input for the following stages.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"b6fc\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The backbone is built in the function&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L171\" target=\"_blank\" rel=\"noopener noreferrer\">resnet_graph()<\/a>. The code supports ResNet50 and ResNet101.<\/p>\n<\/blockquote>\n<h2 id=\"62a8\" class=\"ot mk cn av au el ou ov ow ox oy oz pa pb pc pd pe\" dir=\"ltr\" data-selectable-paragraph=\"\">Feature Pyramid Network<\/h2>\n<figure id=\"attachment_3407\" aria-describedby=\"caption-attachment-3407\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_1sCveJrqfthOQsGGZRs2tQ.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"size-medium wp-image-3407\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_1sCveJrqfthOQsGGZRs2tQ-300x117.png\" alt=\"Feature Pyramid Networks paper\" width=\"300\" height=\"117\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_1sCveJrqfthOQsGGZRs2tQ-300x117.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_1sCveJrqfthOQsGGZRs2tQ.png 452w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-3407\" class=\"wp-caption-text\">Source: Feature Pyramid Networks paper<\/figcaption><\/figure>\n<p id=\"3947\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">While the backbone described above works great, it can be improved upon. The&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/arxiv.org\/abs\/1612.03144\" target=\"_blank\" rel=\"noopener noreferrer\">Feature Pyramid Network (FPN)<\/a>&nbsp;was introduced by the same authors of Mask R-CNN as an extension that can better represent objects at multiple scales.<\/p>\n<p id=\"4b58\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">FPN improves the standard feature extraction pyramid by adding a second pyramid that takes the high level features from the first pyramid and passes them down to lower layers. By doing so, it allows features at every level to have access to both, lower and higher level features.<\/p>\n<p id=\"2946\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Our implementation of Mask RCNN uses a ResNet101 + FPN backbone.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"2994\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The FPN is created in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L1840\" target=\"_blank\" rel=\"noopener noreferrer\">MaskRCNN.build()<\/a>. The section after building the ResNet.&nbsp;<br>RPN introduces additional complexity: rather than a single backbone feature map in the standard backbone (i.e. the top layer of the first pyramid), in FPN there is a feature map at each level of the second pyramid. We pick which to use dynamically depending on the size of the object. I\u2019ll continue to refer to the&nbsp;<strong class=\"lo mi\">backbone feature map<\/strong>&nbsp;as if it\u2019s one feature map, but keep in mind that when using FPN, we\u2019re actually picking one out of several at runtime.<\/p>\n<\/blockquote>\n<h1 id=\"1cdb\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">\u06f2\u066b Region Proposal Network (RPN)<\/h1>\n<figure id=\"attachment_3409\" aria-describedby=\"caption-attachment-3409\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_ESpJx0XLvyBa86TNo2BfLQ.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"size-medium wp-image-3409\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_ESpJx0XLvyBa86TNo2BfLQ-300x292.png\" alt=\"Simplified illustration showing 49 anchor boxes\" width=\"300\" height=\"292\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_ESpJx0XLvyBa86TNo2BfLQ-300x292.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_ESpJx0XLvyBa86TNo2BfLQ.png 593w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-3409\" class=\"wp-caption-text\">Simplified illustration showing 49 anchor boxes<\/figcaption><\/figure>\n<p id=\"465d\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The RPN is a lightweight neural network that scans the image in a sliding-window fashion and finds areas that contain objects.<\/p>\n<p id=\"ab7f\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The regions that the RPN scans over are called&nbsp;<em class=\"mh\">anchors<\/em>. Which are boxes distributed over the image area, as show on the left. This is a simplified view, though. In practice, there are about 200K anchors of different sizes and aspect ratios, and they overlap to cover as much of the image as possible.<\/p>\n<p id=\"8557\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">How fast can the RPN scan that many anchors? Pretty fast, actually. The sliding window is handled by the convolutional nature of the RPN, which allows it to scan all regions in parallel (on a GPU). Further, the RPN doesn\u2019t scan over the image directly (even though we draw the anchors on the image for illustration).&nbsp;<mark class=\"qv qw fh\">Instead, the RPN scans over the backbone feature map.<\/mark>&nbsp;This allows the RPN to reuse the extracted features efficiently and avoid duplicate calculations. With these optimizations, the RPN runs in about 10 ms according to the&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/arxiv.org\/abs\/1506.01497\" target=\"_blank\" rel=\"noopener noreferrer\">Faster RCNN paper<\/a>&nbsp;that introduced it. In Mask RCNN we typically use larger images and more anchors, so it might take a bit longer.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"dbe2\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<br><\/strong>The RPN is created in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L831\" target=\"_blank\" rel=\"noopener noreferrer\">rpn_graph()<\/a>. Anchor scales and aspect ratios are controlled by RPN_ANCHOR_SCALES and RPN_ANCHOR_RATIOS in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/config.py\" target=\"_blank\" rel=\"noopener noreferrer\">config.py<\/a>.<\/p>\n<\/blockquote>\n<p id=\"384b\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The RPN generates two outputs for each anchor:<\/p>\n<figure id=\"attachment_3411\" aria-describedby=\"caption-attachment-3411\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_EMNE8bxOT4RI3HMjIqjCwQ.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"size-medium wp-image-3411\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_EMNE8bxOT4RI3HMjIqjCwQ-300x245.png\" alt=\"\" width=\"300\" height=\"245\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_EMNE8bxOT4RI3HMjIqjCwQ-300x245.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_EMNE8bxOT4RI3HMjIqjCwQ.png 407w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-3411\" class=\"wp-caption-text\">3 anchor boxes (dotted) and the shift\/scale applied to them to fit the object precisely (solid). Several anchors can map to the same object.<\/figcaption><\/figure>\n<ol class=\"\">\n<li id=\"e802\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz pl oe of\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Anchor Class:<\/strong>&nbsp;One of two classes: foreground or background. The FG class implies that there is likely an object in that box.<\/li>\n<li id=\"6e7e\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz pl oe of\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Bounding Box Refinement:<\/strong>&nbsp;A foreground anchor (also called positive anchor) might not be centered perfectly over the object. So the RPN estimates a delta (% change in x, y, width, height) to refine the anchor box to fit the object better.<\/li>\n<\/ol>\n<p dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">\n<\/p><p dir=\"ltr\" data-selectable-paragraph=\"\">\n<\/p><p dir=\"ltr\" data-selectable-paragraph=\"\">\n<\/p><p id=\"bb7a\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Using the RPN predictions, we pick the top anchors that are likely to contain objects and refine their location and size. If several anchors overlap too much, we keep the one with the highest foreground score and discard the rest (referred to as Non-max Suppression). After that we have the final&nbsp;<em class=\"mh\">proposals&nbsp;<\/em>(regions of interest)<em class=\"mh\">&nbsp;<\/em>that we pass to the next stage.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"4b8c\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L255\" target=\"_blank\" rel=\"noopener noreferrer\">ProposalLayer<\/a>&nbsp;is a custom Keras layer that reads the output of the RPN, picks top anchors, and applies bounding box refinement.<\/p>\n<\/blockquote>\n<h1 id=\"1c01\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">\u06f3\u066b ROI Classifier &amp; Bounding Box Regressor<\/h1>\n<p id=\"cdfb\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">This stage runs on the regions of interest (ROIs) proposed by the RPN. And just like the RPN, it generates two outputs for each ROI:<\/p>\n<figure id=\"attachment_3416\" aria-describedby=\"caption-attachment-3416\" style=\"width: 933px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_xQYuM_9mu5kt8nNN8Ms2TQ.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"wp-image-3416 size-full\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_xQYuM_9mu5kt8nNN8Ms2TQ.png\" alt=\"Illustration of stage 2\" width=\"933\" height=\"357\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_xQYuM_9mu5kt8nNN8Ms2TQ.png 933w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_xQYuM_9mu5kt8nNN8Ms2TQ-300x115.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_xQYuM_9mu5kt8nNN8Ms2TQ-768x294.png 768w\" sizes=\"(max-width: 933px) 100vw, 933px\" \/><\/a><figcaption id=\"caption-attachment-3416\" class=\"wp-caption-text\">Illustration of stage 2. Source: Fast R-CNN (https:\/\/arxiv.org\/abs\/1504.08083)<\/figcaption><\/figure>\n<ol class=\"\">\n<li id=\"64f4\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz pl oe of\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Class:<\/strong>&nbsp;The class of the object in the ROI. Unlike the RPN, which has two classes (FG\/BG), this network is deeper and has the capacity to classify regions to specific classes (person, car, chair, \u2026etc.). It can also generate a&nbsp;<em class=\"mh\">background<\/em>&nbsp;class, which causes the ROI to be discarded.<\/li>\n<li id=\"77b5\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz pl oe of\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Bounding Box Refinement:<\/strong>&nbsp;Very similar to how it\u2019s done in the RPN, and its purpose is to further refine the location and size of the bounding box to encapsulate the object.<\/li>\n<\/ol>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"a79e\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The classifier and bounding box regressor are created in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L901\" target=\"_blank\" rel=\"noopener noreferrer\">fpn_classifier_graph()<\/a>.<\/p>\n<\/blockquote>\n<h2 id=\"6611\" class=\"ot mk cn av au el ou ov ow ox oy oz pa pb pc pd pe\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">ROI Pooling<\/h2>\n<p id=\"01b2\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">There is a bit of a problem to solve before we continue. Classifiers don\u2019t handle variable input size very well. They typically require a fixed input size. But, due to the bounding box refinement step in the RPN, the ROI boxes can have different sizes. That\u2019s where ROI Pooling comes into play.<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"645\" height=\"217\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_bsT00ickNk7vaRJNrTvKPQ.png\" alt=\"\" class=\"wp-image-3418\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_bsT00ickNk7vaRJNrTvKPQ.png 645w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_bsT00ickNk7vaRJNrTvKPQ-300x101.png 300w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><figcaption> The feature map here is from a low-level layer, for illustration, to make it easier to understand.<\/figcaption><\/figure><\/div>\n\n\n<p id=\"cb1d\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">ROI pooling refers to cropping a part of a feature map and resizing it to a fixed size. It\u2019s similar in principle to cropping part of an image and then resizing it (but there are differences in implementation details).<\/p>\n<p id=\"3ba3\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The authors of Mask R-CNN suggest a method they named ROIAlign, in which they sample the feature map at different points and apply a bilinear interpolation. In our implementation, we used TensorFlow\u2019s&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/image\/crop_and_resize\" target=\"_blank\" rel=\"noopener noreferrer\">crop_and_resize<\/a>&nbsp;function for simplicity and because it\u2019s close enough for most purposes.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"5e88\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>ROI pooling is implemented in the class&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L344\" target=\"_blank\" rel=\"noopener noreferrer\">PyramidROIAlign<\/a>.<\/p>\n<\/blockquote>\n<h1 id=\"814c\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">\u06f4\u066b Segmentation Masks<\/h1>\n<p id=\"0414\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" style=\"text-align: left;\" data-selectable-paragraph=\"\">If you stop at the end of the last section then you have a&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/arxiv.org\/abs\/1506.01497\" target=\"_blank\" rel=\"noopener noreferrer\">Faster R-CNN<\/a>framework for object detection. The mask network is the addition that the Mask R-CNN paper introduced.<\/p>\n<p dir=\"ltr\" data-selectable-paragraph=\"\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_l55WzUq1ZD2b5EGwW05LDA.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"size-medium wp-image-3420 alignleft\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_l55WzUq1ZD2b5EGwW05LDA-300x133.png\" alt=\"\" width=\"300\" height=\"133\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_l55WzUq1ZD2b5EGwW05LDA-300x133.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_l55WzUq1ZD2b5EGwW05LDA.png 455w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p id=\"46db\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The mask branch is a convolutional network that takes the positive regions selected by the ROI classifier and generates masks for them. The generated masks are low resolution: 28&#215;28 pixels. But they are&nbsp;<em class=\"mh\">soft<\/em>&nbsp;masks, represented by float numbers, so they hold more details than binary masks. The small mask size helps keep the mask branch light. During training, we scale down the ground-truth masks to 28&#215;28 to compute the loss, and during inferencing we scale up the predicted masks to the size of the ROI bounding box and that gives us the final masks, one per object.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"63ed\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The mask branch is in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/model.py#L957\" target=\"_blank\" rel=\"noopener noreferrer\">build_fpn_mask_graph()<\/a>.<\/p>\n<\/blockquote>\n<h1 id=\"4f9e\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Let\u2019s Build a Color Splash Filter<\/h1>\n<figure id=\"attachment_3440\" aria-describedby=\"caption-attachment-3440\" style=\"width: 460px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/Color-Splash-Filter.gif\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"wp-image-3440 size-full\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/Color-Splash-Filter.gif\" alt=\"Color Splash Filter\" width=\"460\" height=\"444\" title=\"\"><\/a><figcaption id=\"caption-attachment-3440\" class=\"wp-caption-text\">Sample generated by this project<\/figcaption><\/figure>\n<p id=\"e4db\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Unlike most image editing apps that include this filter, our filter will be a bit smarter: It finds the objects automatically. Which becomes even more useful if you want to apply it to videos rather than a single image.<\/p>\n<h1 id=\"89d9\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Training Dataset<\/h1>\n<p id=\"7371\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Typically, I\u2019d start by searching for public datasets that contain the objects I need. But in this case, I wanted to document the full cycle and show how to build a dataset from scratch.<\/p>\n<p id=\"dfe0\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">I searched for balloon images on flickr, limiting the license type to \u201cCommercial use &amp; mods allowed\u201d. This returned more than enough images for my needs. I picked a total of 75 images and divided them into a training set and a validation set. Finding images is easy. Annotating them is the hard part.<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"871\" height=\"673\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_Q4tCdhwrklvJLM9zn5aDhg.png\" alt=\"\" class=\"wp-image-3421\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_Q4tCdhwrklvJLM9zn5aDhg.png 871w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_Q4tCdhwrklvJLM9zn5aDhg-300x232.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_Q4tCdhwrklvJLM9zn5aDhg-768x593.png 768w\" sizes=\"(max-width: 871px) 100vw, 871px\" \/><\/figure><\/div>\n\n\n<p id=\"2c9d\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Wait! Don\u2019t we need, like, a million images to train a deep learning model? Sometimes you do, but often you don\u2019t. I\u2019m relying on two main points to reduce my training requirements significantly:<\/p>\n<p id=\"879c\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">First,&nbsp;<em class=\"mh\">transfer learning.&nbsp;<\/em>Which simply means that, instead of training a model from scratch, I start with a weights file that\u2019s been trained on the COCO dataset (we provide that in the github repo). Although the COCO dataset does&nbsp;<strong class=\"lo mi\">not<\/strong>&nbsp;contain a balloon class, it contains a lot of other images (~120K), so the trained weights have already learned a lot of the features common in natural images, which really helps. And, second, given the simple use case here, I\u2019m not demanding high accuracy from this model, so the tiny dataset should suffice.<\/p>\n<p id=\"416f\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">There are a lot of tools to annotate images. I ended up using&nbsp;<a class=\"br dd ma mb mc md\" href=\"http:\/\/www.robots.ox.ac.uk\/~vgg\/software\/via\/\" target=\"_blank\" rel=\"noopener noreferrer\">VIA (VGG Image Annotator)<\/a> because of its simplicity. It\u2019s a single HTML file that you download and open in a browser. Annotating the first few images was very slow, but once I got used to the user interface, I was annotating at around an object a minute.<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"948\" height=\"661\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_6SICkQA-YCLp88A7GFM4Ag.png\" alt=\"\" class=\"wp-image-3422\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_6SICkQA-YCLp88A7GFM4Ag.png 948w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_6SICkQA-YCLp88A7GFM4Ag-300x209.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_6SICkQA-YCLp88A7GFM4Ag-768x535.png 768w\" sizes=\"(max-width: 948px) 100vw, 948px\" \/><figcaption> UI of the VGG Image Annotator tool<br><br> <\/figcaption><\/figure><\/div>\n\n\n<p id=\"a07c\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">If you don\u2019t like the VIA tool, here is a list of the other tools I tested:<\/p>\n<ul class=\"\" dir=\"ltr\">\n<li id=\"4e5e\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz od oe of\" data-selectable-paragraph=\"\"><a class=\"br dd ma mb mc md\" href=\"http:\/\/labelme2.csail.mit.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">LabelMe<\/a>: One of the most known tools. The UI was a bit too slow, though, especially when zooming in on large images.<\/li>\n<li id=\"f464\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><a class=\"br dd ma mb mc md\" href=\"https:\/\/rectlabel.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">RectLabel<\/a>: Simple and easy to work with. Mac only.<\/li>\n<li id=\"7ae8\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><a class=\"br dd ma mb mc md\" href=\"https:\/\/www.labelbox.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">LabelBox<\/a>: Pretty good for larger labeling projects and has options for different types of labeling tasks.<\/li>\n<li id=\"2e31\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><a class=\"br dd ma mb mc md\" href=\"http:\/\/www.robots.ox.ac.uk\/~vgg\/software\/via\/\" target=\"_blank\" rel=\"noopener noreferrer\">VGG Image Annotator (VIA)<\/a>: Fast, light, and really well designed. This is the one I ended up using.<\/li>\n<li id=\"8ca5\" class=\"lm ln cn av lo b lp og lr oh lt oi lv oj lx ok lz od oe of\" data-selectable-paragraph=\"\"><a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/tylin\/coco-ui\" target=\"_blank\" rel=\"noopener noreferrer\">COCO UI<\/a>: The tool used to annotate the COCO dataset.<\/li>\n<\/ul>\n<h1 id=\"d280\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Loading the Dataset<\/h1>\n<p id=\"a153\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">There isn\u2019t a universally accepted format to store segmentation masks. Some datasets save them as PNG images, others store them as polygon points, and so on. To handle all these cases, our implementation provides a Dataset class that you inherit from and then override a few functions to read your data in whichever format it happens to be.<\/p>\n<p id=\"b712\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The VIA tool saves the annotations in a JSON file, and each mask is a set of polygon points. I didn\u2019t find documentation for the format, but it\u2019s pretty easy to figure out by looking at the generated JSON. I included comments in the code to explain how the parsing is done.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"e128\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>An easy way to write code for a new dataset is to copy&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/samples\/coco\/coco.py\" target=\"_blank\" rel=\"noopener noreferrer\">coco.py<\/a>&nbsp;and modify it to your needs. Which is what I did. I saved the new file as&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/balloon.py\" target=\"_blank\" rel=\"noopener noreferrer\">balloons.py<\/a><\/p>\n<\/blockquote>\n<p dir=\"ltr\">My <em>BalloonDataset&nbsp;<\/em>class looks like this:<span id=\"a411\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\"><\/span><\/p>\n<pre dir=\"ltr\"><span id=\"71f6\" class=\"ot mk cn av qa b bm qb qc l qd\" data-selectable-paragraph=\"\">class <strong class=\"qa mi\">BalloonDataset<\/strong>(utils.Dataset):<\/span><span id=\"1d9c\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\"> <br>def <strong class=\"qa mi\">load_balloons<\/strong>(self, dataset_dir, subset):<br>...<\/span><span id=\"5c60\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\"> <br><br>def <strong class=\"qa mi\">load_mask<\/strong>(self, image_id):<br>...<\/span><span id=\"a411\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\"> <br><br>def <strong class=\"qa mi\">image_reference<\/strong>(self, image_id):<br>...<\/span><\/pre>\n<p id=\"2f1b\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><em>load_balloons&nbsp;<\/em>reads the JSON file, extracts the annotations, and iteratively calls the internal <em>add_class&nbsp;<\/em>and <em>add_image&nbsp;<\/em>functions to build the dataset.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"fd34\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><em>load_mask&nbsp;<\/em>generates bitmap masks for every object in the image by drawing the polygons.<\/p>\n<\/blockquote>\n<p id=\"7af2\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><em>image_reference&nbsp;<\/em>simply returns a string that identifies the image for debugging purposes. Here it simply returns the path of the image file.<\/p>\n<p id=\"0866\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">You might have noticed that my class doesn\u2019t contain functions to load images or return bounding boxes. The default <em>load_image<\/em>&nbsp;function in the base <em>Dataset&nbsp;<\/em>class handles loading images. And, bounding boxes are generated dynamically from the masks.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"dfeb\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>Your dataset might not be in JSON. My BalloonDataset class reads JSON because that\u2019s what the VIA tool generates. Don\u2019t convert your dataset to a format similar to COCO or the VIA format. Insetad, write your own Dataset class to load whichever format your dataset comes in. See the&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/tree\/master\/samples\" target=\"_blank\" rel=\"noopener noreferrer\">samples<\/a>&nbsp;and notice how each uses its own Dataset class.<\/p>\n<\/blockquote>\n<h2 id=\"58a4\" class=\"ot mk cn av au el ou ov ow ox oy oz pa pb pc pd pe\" dir=\"ltr\" data-selectable-paragraph=\"\">Verify the Dataset<\/h2>\n<p id=\"05e7\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">To verify that my new code is implemented correctly I added this&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/inspect_balloon_data.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">Jupyter notebook<\/a>. It loads the dataset, visualizes masks and bounding boxes, and visualizes the anchors to verify that my anchor sizes are a good fit for my object sizes. Here is an example of what you should expect to see:<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"831\" height=\"583\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_OKE6wyZFfh2f_aZ3rd9BRw.png\" alt=\"\" class=\"wp-image-3432\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_OKE6wyZFfh2f_aZ3rd9BRw.png 831w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_OKE6wyZFfh2f_aZ3rd9BRw-300x210.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_OKE6wyZFfh2f_aZ3rd9BRw-768x539.png 768w\" sizes=\"(max-width: 831px) 100vw, 831px\" \/><figcaption>  Sample from inspect_balloon_data notebook <\/figcaption><\/figure><\/div>\n\n\n<blockquote class=\"me mf mg\">\n<p id=\"db09\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>To create this notebook I copied&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/samples\/coco\/inspect_data.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">inspect_data.ipynb<\/a>, which we wrote for the COCO dataset, and modified one block of code at the top to load the Balloons dataset instead.<\/p>\n<\/blockquote>\n<h1 id=\"da5a\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Configurations<\/h1>\n<p id=\"e824\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The configurations for this project are similar to the base configuration used to train the COCO dataset, so I just needed to override 3 values. As I did with the <em>Dataset&nbsp;<\/em>class, I inherit from the base <em>Config&nbsp;<\/em>class and add my overrides:<\/p>\n<pre class=\"nc nd ne nf ng ib ge ds\" dir=\"ltr\"><span id=\"b1b8\" class=\"ot mk cn av qa b bm qb qc l qd\" data-selectable-paragraph=\"\">class BalloonConfig(Config):<\/span><span id=\"df09\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\">    <br>    # Give the configuration a recognizable name<br>    NAME = \"balloons\"<br><\/span><span id=\"569d\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\">    <br>    # Number of classes (including background)<br>    NUM_CLASSES = 1 + 1  # Background + balloon<\/span><span id=\"db8c\" class=\"ot mk cn av qa b bm qe qf qg qh qi qc l qd\" data-selectable-paragraph=\"\"> <br><br>   <br>    # Number of training steps per epoch<br>    STEPS_PER_EPOCH = 100<\/span><\/pre>\n<p id=\"004b\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The base configuration uses input images of size 1024&#215;1024 px for best accuracy. I kept it that way. My images are a bit smaller, but the model resizes them automatically.<\/p>\n<blockquote class=\"me mf mg\" dir=\"ltr\">\n<p id=\"b92d\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The base Config class is in&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/mrcnn\/config.py\" target=\"_blank\" rel=\"noopener noreferrer\">config.py<\/a>. And BalloonConfig is in<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/balloon.py#L61\" target=\"_blank\" rel=\"noopener noreferrer\">&nbsp;balloons.py<\/a>.<\/p>\n<\/blockquote>\n<h1 id=\"1f2b\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Training<\/h1>\n<p id=\"0a7c\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Mask R-CNN is a fairly large model. Especially that our implementation uses ResNet101 and FPN. So you need a modern GPU with 12GB of memory. It might work on less, but I haven\u2019t tried. I used&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/p2\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon\u2019s P2 instances<\/a>&nbsp;to train this model, and given the small dataset, training takes less than an hour.<\/p>\n<p id=\"7837\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Start the training with this command, running from the&nbsp;<code class=\"ny py pz qa b\">balloon<\/code>&nbsp;directory. Here, we\u2019re specifying that training should start from the pre-trained COCO weights. The code will download the weights from our repository automatically:<\/p>\n<pre class=\"nc nd ne nf ng ib ge ds\" dir=\"ltr\"><span id=\"c106\" class=\"ot mk cn av qa b bm qb qc l qd\" data-selectable-paragraph=\"\">python3 balloon.py train --dataset=\/path\/to\/dataset <strong class=\"qa mi\">--model=coco<\/strong><\/span><\/pre>\n<p id=\"fc26\" class=\"lm ln cn av lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\">And to resume training if it stopped:<\/p>\n<pre class=\"nc nd ne nf ng ib ge ds\" dir=\"ltr\"><span id=\"b555\" class=\"ot mk cn av qa b bm qb qc l qd\" data-selectable-paragraph=\"\">python3 balloon.py train --dataset=\/path\/to\/dataset <strong class=\"qa mi\">--model=last<\/strong><\/span><\/pre>\n<blockquote class=\"me mf mg\">\n<p id=\"d71e\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<br><\/strong>In addition to balloons.py, the repository has three more examples:&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/samples\/shapes\/train_shapes.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">train_shapes.ipynb<\/a>&nbsp;which trains a toy model to detect geometric shapes,&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/samples\/coco\/coco.py\" target=\"_blank\" rel=\"noopener noreferrer\">coco.py<\/a>&nbsp;which trains on the COCO dataset, and&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/tree\/master\/samples\/nucleus\" target=\"_blank\" rel=\"noopener noreferrer\">nucleus<\/a>&nbsp;which segments nuclei in microscopy images.<\/p>\n<\/blockquote>\n<h1 id=\"6bcb\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Inspecting the Results<\/h1>\n<p id=\"a77c\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">The&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/inspect_balloon_model.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">inspect_balloon_model<\/a> notebook shows the results generated by the trained model. Check the notebook for more visualizations and a step by step walk through the detection pipeline.<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"812\" height=\"347\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_BvqnziHW514YyO20UNtS3g.png\" alt=\"\" class=\"wp-image-3434\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_BvqnziHW514YyO20UNtS3g.png 812w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_BvqnziHW514YyO20UNtS3g-300x128.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_BvqnziHW514YyO20UNtS3g-768x328.png 768w\" sizes=\"(max-width: 812px) 100vw, 812px\" \/><\/figure><\/div>\n\n\n<blockquote class=\"me mf mg\">\n<p id=\"29f6\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>This notebook is a simplified version of&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/master\/samples\/coco\/inspect_model.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">inspect_mode.ipynb<\/a>, which includes visualizations and debugging code for the COCO dataset.<\/p>\n<\/blockquote>\n<h1 id=\"1b36\" class=\"mj mk cn av au el ml mm mn mo mp mq mr ms mt mu mv\" dir=\"ltr\" data-selectable-paragraph=\"\">Color Splash<\/h1>\n<p id=\"5e81\" class=\"lm ln cn av lo b lp mw lr mx lt my lv mz lx na lz\" dir=\"ltr\" data-selectable-paragraph=\"\">Finally, now that we have object masks, let\u2019s use them to apply the color splash effect. The method is really simple: create a grayscale version of the image, and then, in areas marked by the object mask, copy back the color pixels from original image. Here is an example:<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" width=\"942\" height=\"494\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_iPAtWFnShPhX5atbY3V0pQ.png\" alt=\"\" class=\"wp-image-3435\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_iPAtWFnShPhX5atbY3V0pQ.png 942w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_iPAtWFnShPhX5atbY3V0pQ-300x157.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_iPAtWFnShPhX5atbY3V0pQ-768x403.png 768w\" sizes=\"(max-width: 942px) 100vw, 942px\" \/><\/figure><\/div>\n\n\n<blockquote class=\"me mf mg\">\n<p id=\"5917\" class=\"lm ln cn mh lo b lp lq lr ls lt lu lv lw lx ly lz\" dir=\"ltr\" data-selectable-paragraph=\"\"><strong class=\"lo mi\">Code Tip:<\/strong><br>The code that applies the effect is in the&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/balloon.py#L201\" target=\"_blank\" rel=\"noopener noreferrer\">color_splash()<\/a>&nbsp;function. And&nbsp;<a class=\"br dd ma mb mc md\" href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/blob\/v2.1\/samples\/balloon\/balloon.py#L221\" target=\"_blank\" rel=\"noopener noreferrer\">detect_and_color_splash()<\/a>&nbsp;handles the whole process from loading the image, running instance segmentation, and applying the color splash filter.<\/p>\n<\/blockquote>\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" width=\"937\" height=\"621\" src=\"https:\/\/shahaab-co.ir\/mag\/wp-content\/uploads\/2019\/08\/1_w_ownWZZ38QhiVjVU757DA.png\" alt=\"\" class=\"wp-image-3437\" title=\"\" srcset=\"https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_w_ownWZZ38QhiVjVU757DA.png 937w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_w_ownWZZ38QhiVjVU757DA-300x199.png 300w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_w_ownWZZ38QhiVjVU757DA-768x509.png 768w, https:\/\/shahaab-co.com\/mag\/wp-content\/uploads\/2019\/08\/1_w_ownWZZ38QhiVjVU757DA-310x205.png 310w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/figure>\n\n\n<h4 dir=\"ltr\"><strong>&nbsp;Read More :<\/strong><\/h4>\n\n<ul class=\"wp-block-latest-posts__list wp-block-latest-posts\"><li><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/shahaab-co.com\/mag\/news\/ai\/how-to-spot-deepfake\/\">\u062f\u06cc\u067e \u0641\u06cc\u06a9 \u0648 \u0686\u06af\u0648\u0646\u06af\u06cc \u062a\u0634\u062e\u06cc\u0635 \u0622\u0646  \u062a\u0648\u0633\u0637 \u0647\u0648\u0634 \u0645\u0635\u0646\u0648\u0639\u06cc<\/a><\/li>\n<li><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/shahaab-co.com\/mag\/en-articles\/instance-segmentation-with-mask-r-cnn-and-tensorflow\/\">Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow<\/a><\/li>\n<li><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/shahaab-co.com\/mag\/en-articles\/computer-vision-tutorial-implementing-mask-r-cnn-for-image-segmentation-with-python-code\/\">Computer Vision Tutorial: Implementing Mask R-CNN for Image Segmentation + Python Code<\/a><\/li>\n<li><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/shahaab-co.com\/mag\/edu\/weight-initialization-in-deep-learning\/\">Weight Initialization in Deep Learning<\/a><\/li>\n<li><a class=\"wp-block-latest-posts__post-title\" href=\"https:\/\/shahaab-co.com\/mag\/en-articles\/activation-functions-in-deep-learning\/\">Activation Functions in Deep Learning<\/a><\/li>\n<\/ul>\n\n<p dir=\"ltr\" style=\"text-align: right;\"><a href=\"#\" class=\"shortc-button small blue \">Source<\/a> <a href=\"https:\/\/engineering.matterport.com\/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46\" target=\"_blank\" class=\"shortc-button small gray \" rel=\"noopener\">Matterport<\/a>\n\n<div class=\"kk-star-ratings kksr-auto kksr-align-right kksr-valign-bottom\"\n    data-payload='{&quot;align&quot;:&quot;right&quot;,&quot;id&quot;:&quot;3392&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;bottom&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;0&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;0&quot;,&quot;starsonly&quot;:&quot;&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;\u0627\u0645\u062a\u06cc\u0627\u0632 \u062f\u0647\u06cc\u062f!&quot;,&quot;legend&quot;:&quot;0\\\/5 - (0 \u0627\u0645\u062a\u06cc\u0627\u0632)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;title&quot;:&quot;Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow&quot;,&quot;width&quot;:&quot;0&quot;,&quot;_legend&quot;:&quot;{score}\\\/{best} - ({count} \u0627\u0645\u062a\u06cc\u0627\u0632)&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n            \n<div class=\"kksr-stars\">\n    \n<div class=\"kksr-stars-inactive\">\n            <div class=\"kksr-star\" data-star=\"1\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"2\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"3\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"4\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" data-star=\"5\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n    <\/div>\n    \n<div class=\"kksr-stars-active\" style=\"width: 0px;\">\n            <div class=\"kksr-star\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n            <div class=\"kksr-star\" style=\"padding-left: 5px\">\n            \n\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n        <\/div>\n    <\/div>\n<\/div>\n                \n\n<div class=\"kksr-legend\" style=\"font-size: 19.2px;\">\n            <span class=\"kksr-muted\">\u0627\u0645\u062a\u06cc\u0627\u0632 \u062f\u0647\u06cc\u062f!<\/span>\n    <\/div>\n    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>Explained by building a color splash filter Waleed Abdulla Back in November, we open-sourced our&nbsp;implementation of Mask R-CNN, and since then it\u2019s been forked 1400 times, used in a lot of projects, and improved upon by many generous contributors. We received a lot of questions as well, so in this post I\u2019ll explain how the &hellip;<\/p>\n","protected":false},"author":7,"featured_media":3437,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,22,18],"tags":[166,107,82,86],"class_list":["post-3392","post","type-post","status-publish","format-standard","has-post-thumbnail","","category-en-articles","category-machine-vision","category-edu","tag-r-cnn","tag-107","tag-82","tag-86"],"_links":{"self":[{"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/posts\/3392","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/comments?post=3392"}],"version-history":[{"count":1,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/posts\/3392\/revisions"}],"predecessor-version":[{"id":16902,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/posts\/3392\/revisions\/16902"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/media\/3437"}],"wp:attachment":[{"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/media?parent=3392"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/categories?post=3392"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/shahaab-co.com\/mag\/wp-json\/wp\/v2\/tags?post=3392"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}