Segmentation of Images using Deep Learning


In computer vision, image segmentation is the process of dividing an image into parts and extracting the regions of interest. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels having similar characteristics have the same label. Historically, most problems in computer vision involved using image processing techniques for segmentation followed by using a machine learning technique for labeling the segment. However, after ImageNet 2012, deep learning became the go to method for most computer vision tasks


Image net 2012 was a big turning point in the history of computer vision. The comprehensive victory of the team led by Alex Krizhevsky established deep learning as the gold standard technique when it comes to image classification. The following years showed significant improvement in classification accuracy, from 84.7% in 2012 to 96.92 in 2016. Along with the semantic classification of whole images, progress was happening in bounding box detection and other localization tasks. The next logical step was to label each pixel in an image.


The initial experiments in deep learning based image segmentation involved a sliding window setup where a patch around the pixel is taken which provides local context. One benefit of this idea was that the training data is significantly larger than the number of pixels available, which helps in cases like medical segmentation where training data is low in number. However, the drawbacks of this method are 1) the invocation has to happen on all pixels and 2) there is lot of redundancy in the available data. Additionally figuring out the optimal patch size is also a difficult problem. Increasing the patch size gives more contextual information, however, they require more number of max pooling layers, which leads to lower localization accuracy. Both these problems were solved by using fully convolutional networks like FCN and UNet.


The main ideas behind these networks are :

  1. The later layers have features which help explain what the image contains. However, for exact localization of objects of interest we need more high resolution features. Therefore skip connections are used to combine information from the earlier layers with the low resolution information from the deeper layers. (the unet architecture can been seen in Figure 1)
  2. Using deconvolution networks for upsampling.
  3. UNet also introduced a different cost function which gives higher weightage to error in borders. This leads to models with more precise segmentations.

Fig 1). U-Net architecture. The arrows show the skip connections between the layers. The number of feature maps are shown on top of the boxes.

Fig 2) This shows the performance of UNet vs Image processing in the task of segmenting bright lesions in a fundus photo. The image on the right is predicted by UNet. Compared to the image on the left, it has significantly higher sensitivity and specificity.

Like Alexnet in classification, both FCN and UNet were significant milestones in the history of image segmentation. UNet won the ISBI cell tracking challenge in 2015 with a IOU score 50% higher than the second placed algorithm. Given the way deep learning has been growing in stature, it is easy to assume that image processing techniques no longer have any place to play in image segmentation tasks. (Figure 2 compares the performance of UNet and an image processing based technique at identifying bright lesions in a fundus photograph). However, deep learning still requires a significant amount of labeled data, which is not available in many domains. So at least in the near future image processing techniques will continue to play a part in most computer vision tasks.



[1] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer International Publishing, 2015.

[2] Shelhamer, Evan, Jonathon Long, and Trevor Darrell. “Fully convolutional networks for semantic segmentation.” IEEE transactions on pattern analysis and machine intelligence (2016).