THESIS
2021
1 online resource (x, 57 pages) : illustrations (chiefly color)
Abstract
We present a simple yet effective approach to modeling space-time correspondences in
the context of video object segmentation. Unlike most existing approaches, we establish
correspondences directly between frames without re-encoding the mask features for every object,
leading to an efficient and robust framework. With the correspondences, every node in the
current query frame is inferred by aggregating features from the past in an associative fashion.
We cast the aggregation process as a voting problem and find that the existing inner-product
affinity leads to poor use of memory with a small (fixed) subset of memory nodes dominating the
votes, regardless of the query. With our proposed negative squared Euclidean distance, every
memory node now has a chance to contribute, and such divers...[
Read more ]
We present a simple yet effective approach to modeling space-time correspondences in
the context of video object segmentation. Unlike most existing approaches, we establish
correspondences directly between frames without re-encoding the mask features for every object,
leading to an efficient and robust framework. With the correspondences, every node in the
current query frame is inferred by aggregating features from the past in an associative fashion.
We cast the aggregation process as a voting problem and find that the existing inner-product
affinity leads to poor use of memory with a small (fixed) subset of memory nodes dominating the
votes, regardless of the query. With our proposed negative squared Euclidean distance, every
memory node now has a chance to contribute, and such diversified voting is beneficial to both
memory efficiency and accuracy.
Next, we present a novel modular paradigm to incorporate user interactions in the process by
decoupling interaction-to-mask and mask propagation, allowing for higher generalizability and
better performance. Trained separately, the interaction module converts user interactions to an
object mask, which is then temporally propagated by our propagation module. To effectively take
the user’s intent into account, a difference-aware fusion module is used to align target features
with space-time attention.
We also contribute a large-scale, pixel-accurate, and synthetic dataset BL30K which can be
used for pretraining for a further performance boost. The resultant model achieves state-of-the-art
results in both semi-supervised mask propagation and interactive video object segmentation
settings with a fast running time.
Post a Comment