THESIS
2022
1 online resource (xiv, 133 pages) : illustrations (some color)
Abstract
Deep Learning has emerged as a milestone in the machine learning community due to
its remarkable ability in a variety of tasks, such as computer vision and natural language
processing. It has been demonstrated that the architecture of a neural network influences
the performance significantly and thus it’s important to determine the neural architecture
structure. Typically, the methods for neural architecture design can be classified into two
categories. One category is designing neural architecture by search methods, which aims to
achieve potential neural architectures automatically. For example, NASNet architecture is
found in a defined search space using a reinforcement learning algorithm. Another category
is designing neural architecture manually based on domain knowledge. Most pract...[
Read more ]
Deep Learning has emerged as a milestone in the machine learning community due to
its remarkable ability in a variety of tasks, such as computer vision and natural language
processing. It has been demonstrated that the architecture of a neural network influences
the performance significantly and thus it’s important to determine the neural architecture
structure. Typically, the methods for neural architecture design can be classified into two
categories. One category is designing neural architecture by search methods, which aims to
achieve potential neural architectures automatically. For example, NASNet architecture is
found in a defined search space using a reinforcement learning algorithm. Another category
is designing neural architecture manually based on domain knowledge. Most practical
architectures like ResNet and Transformer are proposed based on prior knowledge. In
this thesis, we provide a comprehensive discussion on neural architecture design from the
above two perspectives.
Firstly, we introduce a neural architecture search algorithm using Bayesian optimization,
named BONAS. In the search phase, GCN embedding extractor and Bayesian sigmoid
regressor constitute the surrogate model for Bayesian optimization and candidate architectures in the search space are selected based on the acquisition function. In the query
phase, we merge them as a super network and evaluate each architecture by weight sharing
mechanism. The proposed BONAS can obtain significant architecture with exploitation
and exploration balance.
Secondly, we focus on the self-attention module in the famous Transformer and propose
a differentiable architecture search method to find important attention patterns. Different
from prior works, we find that diagonal elements in the attention map can be dropped
without harming the performance. To understand this observation, we provide theoretical
proof from the perspective of universal approximation. Furthermore, we achieve a series
of attention masks for efficient architecture design based on our proposed search method.
Thirdly, we attempt to understand the feed-forward module in Transformer from a
unified framework. Specifically, we introduce the concept of memory tokens and build
the relationship between feed-forward and self-attention. Moreover, we propose a novel
architecture named uni-attention, which contains all four types of attention connections
in our framework. Uni-attention achieves better performance compared with previous
baselines given the same number of memory tokens.
Finally, we investigate the over-smoothing phenomenon in the whole Transformer
architecture. We provide a theoretical analysis by building the relationship between self-attention
and the graph field. Specifically, we find that layer normalization plays an
important role in the over-smoothing problem, and verify this empirically. To alleviate
this issue, we propose hierarchical fusion architectures such that the output can be more
diverse.
Post a Comment