In this paper, we study a challenging but less-touched problem in cross-modal retrieval, i.e. , partially mismatched pairs (PMPs). Specifically, in real-world scenarios, a huge number of multimedia data ( e.g. , the Conceptual Captions dataset) are collected from the Internet, and thus it is inevitable to wrongly treat some irrelevant cross-modal pairs as matched. Undoubtedly, such a PMP problem will remarkably degrade the cross-modal retrieval performance. To tackle this problem, we derive a unified theoretical Robust Cross-modal Learning framework (RCL) with an unbiased estimator of the cross-modal retrieval risk, which aims to endow the cross-modal retrieval methods with robustness against PMPs. In detail, our RCL adopts a novel complementary contrastive learning paradigm to address the following two challenges, i.e. , the overfitting and underfitting issues. On the one hand, our method only utilizes the negative information which is much less likely false compared with the positive information, thus avoiding the overfitting issue to PMPs. However, these robust strategies could induce underfitting issues, thus making training models more difficult. On the other hand, to address the underfitting issue brought by weak supervision, we present to leverage of all available negative pairs to enhance the supervision contained in the negative information. Moreover, to further improve the performance, we propose to minimize the upper bounds of the risk to pay more attention to hard samples. To verify the effectiveness and robustness of the proposed method, we carry out comprehensive experiments on five widely-used benchmark datasets compared with nine state-of-the-art approaches w.r.t. the image-text and video-text retrieval tasks. The code is available at https://github.com/penghu-cs/RCL
2022
AAAI
Robust Domain Adaptation for Machine Reading Comprehension
Liang Jiang, Zhenyu Huang, Jia Liu, and
2 more authors
Most domain adaptation methods for machine reading comprehension (MRC) use a pre-trained question-answer (QA) construction model to generate pseudo QA pairs for MRC transfer. Such a process will inevitably introduce mismatched pairs (i.e., noisy correspondence) due to i) the unavailable QA pairs in target documents, and ii) the domain shift during applying the QA construction model to the target domain. Undoubtedly, the noisy correspondence will degenerate the performance of MRC, which however is neglected by existing works. To solve such an untouched problem, we propose to construct QA pairs by additionally using the dialogue related to the documents, as well as a new domain adaptation method for MRC. Specifically, we propose Robust Domain Adaptation for Machine Reading Comprehension (RMRC) method which consists of an answer extractor (AE), a question selector (QS), and an MRC model. Specifically, RMRC filters out the irrelevant answers by estimating the correlation to the document via the AE, and extracts the questions by fusing the candidate questions in multiple rounds of dialogue chats via the QS. With the extracted QA pairs, MRC is fine-tuned and provides the feedback to optimize the QS through a novel reinforced self-training method. Thanks to the optimization of the QS, our method will greatly alleviate the noisy correspondence problem caused by the domain shift. To the best of our knowledge, this could be the first study to reveal the influence of noisy correspondence in domain adaptation MRC models and show a feasible way to achieve robustness to mismatched pairs. Extensive experiments on three datasets demonstrate the effectiveness of our method.
CVPR
Learning with twin noisy labels for visible-infrared person re-identification
Mouxing Yang, Zhenyu Huang, Peng Hu, and
3 more authors
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
In this paper, we study an untouched problem in visible-infrared person re-identification (VI-ReID), namely, Twin Noise Labels (TNL) which refers to as noisy annotation and correspondence. In brief, on the one hand, it is inevitable to annotate some persons with the wrong identity due to the complexity in data collection and annotation, e.g., the poor recognizability in the infrared modality. On the other hand, the wrongly annotated data in a single modality will eventually contaminate the cross-modal correspondence, thus leading to noisy correspondence. To solve the TNL problem, we propose a novel method for robust VI-ReID, termed DuAlly Robust Training (DART). In brief, DART first computes the clean confidence of annotations by resorting to the memorization effect of deep neural networks. Then, the proposed method rectifies the noisy correspondence with the estimated confidence and further divides the data into four groups for further utilizations. Finally, DART employs a novel dually robust loss consisting of a soft identification loss and an adaptive quadruplet loss to achieve robustness on the noisy annotation and noisy correspondence. Extensive experiments on SYSU-MM01 and RegDB datasets verify the effectiveness of our method against the twin noisy labels compared with five state-of-the-art methods. The code could be accessed from https://github.com/XLearning-SCU/2022-CVPR-DART.
2021
NeurIPS Oral
Learning with Noisy Correspondence for Cross-modal Matching
Zhenyu Huang, Guocheng Niu, Xiao Liu, and
4 more authors
In Proceedings of the 35th Conference on Neural Information Processing Systems , NeurIPS’2021, 2021
Cross-modal matching, which aims to establish the correspondence between two different modalities, is fundamental to a variety of tasks such as cross-modal retrieval and vision-and-language understanding. Although a huge number of cross-modal matching methods have been proposed and achieved remarkable progress in recent years, almost all of these methods implicitly assume that the multimodal training data are correctly aligned. In practice, however, such an assumption is extremely expensive even impossible to satisfy. Based on this observation, we reveal and study a latent and challenging direction in cross-modal matching, named noisy correspondence, which could be regarded as a new paradigm of noisy labels. Different from the traditional noisy labels which mainly refer to the errors in category labels, our noisy correspondence refers to the mismatch paired samples. To solve this new problem, we propose a novel method for learning with noisy correspondence, named Noisy Correspondence Rectifier (NCR). In brief, NCR divides the data into clean and noisy partitions based on the memorization effect of neural networks and then rectifies the correspondence via an adaptive prediction model in a co-teaching manner. To verify the effectiveness of our method, we conduct experiments by using the image-text matching as a showcase. Extensive experiments on Flickr30K, MS-COCO, and Conceptual Captions verify the effectiveness of our method. The code could be accessed from www.pengxi.me .
TIP
Deep Spectral Representation Learning From Multi-View Data
Zhenyu Huang, Joey Tianyi Zhou, Hongyuan Zhu, and
3 more authors
Multi-view representation learning (MvRL) aims to learn a consensus representation from diverse sources or domains to facilitate downstream tasks such as clustering, retrieval, and classification. Due to the limited representative capacity of the adopted shallow models, most existing MvRL methods may yield unsatisfactory results, especially when the labels of data are unavailable. To enjoy the representative capacity of deep learning, this paper proposes a novel multi-view unsupervised representation learning method, termed as Multi-view Laplacian Network (MvLNet), which could be the first deep version of the multi-view spectral representation learning method. Note that, such an attempt is nontrivial because simply combining Laplacian embedding (i.e., spectral representation) with neural networks will lead to trivial solutions. To solve this problem, MvLNet enforces an orthogonal constraint and reformulates it as a layer with the help of Cholesky decomposition. The orthogonal layer is stacked on the embedding network so that a common space could be learned for consensus representation. Compared with numerous recent-proposed approaches, extensive experiments on seven challenging datasets demonstrate the effectiveness of our method in three multi-view tasks including clustering, recognition, and retrieval. The source code could be found at www.pengxi.me .
CVPR
Partially View-aligned Representation Learning with Noise-robust Contrastive Loss
Mouxing Yang, Yunfan Li, Zhenyu Huang, and
3 more authors
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR’2021, Jun 2021
In real-world applications, it is common that only a portion of data is aligned across views due to spatial, temporal, or spatiotemporal asynchronism, thus leading to the so-called Partially View-aligned Problem (PVP). To solve such a less-touched problem without the help of labels, we propose simultaneously learning representation and align- ing data using a noise-robust contrastive loss. In brief, for each sample from one view, our method aims to identify its within-category counterparts from other views, and thus the cross-view correspondence could be established. As the contrastive learning needs data pairs as input, we construct positive pairs using the known correspondences and negative pairs using random sampling. To alleviate or even eliminate the influence of the false negatives caused by random sampling, we propose a noise-robust contrastive loss that could adaptively prevent the false negatives from dominating the network optimization. To the best of our knowledge, this could be the first successful attempt of enabling contrastive learning robust to noisy labels. In fact, this work might remarkably enrich the learning paradigm with noisy labels. More specifically, the traditional noisy labels are defined as incorrect annotations for the supervised tasks such as classification. In contrast, this work proposes that the view correspondence might be false, which is remarkably different from the widely-accepted definition of noisy label. Extensive experiments show the promising performance of our method comparing with 10 state-of-the-art multi-view approaches in the clustering and classification tasks. The code will be publicly released at https://pengxi.me.
2020
NeurIPS Oral
Partially View-aligned Clustering
Zhenyu Huang, Peng Hu, Joey Tianyi Zhou, and
2 more authors
In Proceedings of the 34th Conference on Neural Information Processing Systems , NeurIPS’2020, Dec 2020
In this paper, we study one challenging issue in multi-view data clustering. To be specific, for two data matrices \mathbfX^(1) and \mathbfX^(2) corresponding to two views, we do not assume that \mathbfX^(1) and \mathbfX^(2) are fully aligned in row-wise. Instead, we assume that only a small portion of the matrices has established the correspondence in advance. Such a partially view-aligned problem (PVP) could lead to the intensive labor of capturing or establishing the aligned multi-view data, which has less been touched so far to the best of our knowledge. To solve this practical and challenging problem, we propose a novel multi-view clustering method termed partially view-aligned clustering (PVC). To be specific, PVC proposes to use a differentiable surrogate of the non-differentiable Hungarian algorithm and recasts it as a pluggable module. As a result, the category-level correspondence of the unaligned data could be established in a latent space learned by a neural network, while learning a common space across different views using the “aligned” data. Extensive experimental results show promising results of our method in clustering partially view-aligned data.
2019
IJCAI
Multi-view Spectral Clustering Network
Zhenyu Huang, Joey Tianyi Zhou, Xi Peng, and
3 more authors
In Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, IJCAI’2019, 10–16 aug 2019
Multi-view clustering aims to cluster data from diverse sources or domains, which has drawn considerable attention in recent years. In this paper, we propose a novel multi-view clustering method named multi-view spectral clustering network (MvSCN) which could be the first deep version of multi-view spectral clustering to the best of our knowledge. To deeply cluster multi-view data, MvSCN incorporates the local invariance within every single view and the consistency across different views into a novel objective function, where the local invariance is defined by a deep metric learning network rather than the Euclidean distance adopted by traditional approaches. In addition, we enforce and reformulate an orthogonal constraint as a novel layer stacked on an embedding network for two advantages, i.e. jointly optimizing the neural network and performing matrix decomposition and avoiding trivial solutions. Extensive experiments on four challenging datasets demonstrate the effectiveness of our method compared with 10 state-of-the-art approaches in terms of three evaluation metrics.
ICML Oral
COMIC: Multi-view Clustering Without Parameter Selection
Xi Peng, Zhenyu Huang, Jiancheng Lv, and
2 more authors
In Proceedings of the 36th International Conference on Machine Learning, ICML’2019, 09–15 jun 2019
In this paper, we study two challenges in clustering analysis, namely, how to cluster multi-view data and how to perform clustering without parameter selection on cluster size. To this end, we propose a novel objective function to project raw data into one space in which the projection embraces the geometric consistency (GC) and the cluster assignment consistency (CAC). To be specific, the GC aims to learn a connection graph from a projection space wherein the data points are connected if and only if they belong to the same cluster. The CAC aims to minimize the discrepancy of pairwise connection graphs induced from different views based on the view-consensus assumption, \emphi.e., different views could produce the same cluster assignment structure as they are different portraits of the same object. Thanks to the view-consensus derived from the connection graph, our method could achieve promising performance in learning view-specific representation and eliminating the heterogeneous gaps across different views. Furthermore, with the proposed objective, it could learn almost all parameters including the cluster number from data without labor-intensive parameter selection. Extensive experimental results show the promising performance achieved by our method on five datasets comparing with nine state-of-the-art multi-view clustering approaches.
2018
TIE
Multiple Marginal Fisher Analysis
Zhenyu Huang, Hongyuan Zhu, Joey Tianyi Zhou, and
1 more author
IEEE Transactions on Industrial Electronics, Dec 2018
Dimension reduction is a fundamental task of machine learning and computer vision, which is widely used in a variety of industrial applications. Over past decades, a lot of unsupervised and supervised algorithms have been proposed. However, few of them can automatically determine the feature dimension that could be adaptive to different data distributions. To obtain a good performance, it is popular to seek the optimal dimension by exhaustively enumerating some possible values. Clearly, such a scheme is ad-hoc and computational extensive. Therefore, a method which can automatically estimate the feature dimension in an efficient and principled manner is of significant practical and theoretical value. In this paper, we propose a novel supervised subspace learning method called multiple marginal fisher analysis (MMFA), which can automatically estimate the feature dimension. By maxing the inter-class separability among marginal points while minimizing within-class scatter, MMFA obtains low-dimensional representations with outstanding discriminative properties. Extensive experiments show that MMFA not only outperforms other algorithms on clean data but also show robustness on corrupted and disguised data.