Deep convolution networks (CNN) have shown great success in various computer vision
tasks. However, improving the accuracy-speed tradeoff is still challenging. In this thesis, we
divide CNNs into two categories: static CNNs and dynamic CNNs, according to whether the
CNN architectures are conditioned on the input image. Specifically, we have two goals, one is
to improve the efficiency of the static CNNs, and the other is to explore more effective dynamic
neural architectures.
For static CNNs, to improve the efficiency, we investigate the efficient CNN design guidelines
(e.g., ShuffleNetV2). Currently, the neural network architecture design is mostly guided
by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g.,
speed, also depends on the other fac...[
Read more ]
Deep convolution networks (CNN) have shown great success in various computer vision
tasks. However, improving the accuracy-speed tradeoff is still challenging. In this thesis, we
divide CNNs into two categories: static CNNs and dynamic CNNs, according to whether the
CNN architectures are conditioned on the input image. Specifically, we have two goals, one is
to improve the efficiency of the static CNNs, and the other is to explore more effective dynamic
neural architectures.
For static CNNs, to improve the efficiency, we investigate the efficient CNN design guidelines
(e.g., ShuffleNetV2). Currently, the neural network architecture design is mostly guided
by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g.,
speed, also depends on the other factors such as memory access cost and platform characterics.
Thus, we propose to evaluate the direct metric on the target platform, beyond only considering
FLOPs. Based on a series of controlled experiments, this work derives several practical
guidelines for efficient network design. Accordingly, a new architecture is presented, called
ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art
in terms of speed and accuracy tradeoff.
For dynamic CNNs, we present three simple, efficient, and effective methods. First, we
present WeightNet that decouples the convolutional kernels and the convolutional computation.
This is different from the common practice that all the input samples share the same convolutional
kernel. In that case, convolution kernels are learnable hyper-parameters, in our case,
the kernels are learned by an additional simple network made of fully-connected layers. Our
approach is general that unifies two current distinct and extremely effective SENet and Cond-Conv into the same framework on weight space. We use the WeightNet, composed entirely
of (grouped) fully-connected layers, to directly output the convolutional weight. The simple
change has a large impact: it provides a meta-network design space, improves accuracy significantly,
and achieves optimum Accuracy-FLOPs and Accuracy-Parameter trade-offs.
Next, we present a new visual activation we call funnel activation, that performs the nonlinear
transformation while simultaneously capturing the spatial dependency. Our method extends
the ReLU by adding a negligible overhead spatial condition to replace the hand-designed
zero in ReLU, which helps capture complicated visual layouts with regular convolution. Despite
it seems a minor change, it has a large impact: it shows great improvements in many visual
recognition tasks and even outperforms the complicated DeformableConv and SENet.
Third, we present a simple, effective, and general activation function we term ACON which
learns to activate the neurons or not. Interestingly, we find Swish, the recent popular NAS-searched
activation, can be interpreted as a smooth approximation to ReLU. Intuitively, in the
same way, we approximate the more general Maxout family to our novel ACON family, which
remarkably improves the performance and makes Swish a special case of ACON. Next, we
present meta-ACON, which explicitly learns to optimize the parameter switching between nonlinear
(activate) and linear (inactivate) and provides a new design space. By simply changing the
activation function, we show its effectiveness on both small models and highly optimized large
models (e.g. it improves the ImageNet top-1 accuracy rate by 6.7% and 1.8% on MobileNet-0.25 and ResNet-152, respectively). Moreover, our novel ACON can be naturally transferred to
object detection and semantic segmentation, showing that ACON is an effective alternative in a
variety of tasks.
Post a Comment