Semi-Supervised Learning 정리

13 min readAug 24, 2020

연구실 세미나 정리 (1)

연구실에서 Semi-Supervised Learning 관련 세미나를 준비하게 되었다. 자료 중심으로 세미나를 준비하게 되어서 논문의 깊은 알고리즘까지는 다루지 않지만 간단하게 Semi-Supervised Learning 내의 컨셉을 알 수 있게 정리해보았다.

본격적인 내용에 앞서 준지도학습(Semi-supervised learning)에 간단하게 설명하자면, labeled data가 충분하지 않을 때 unlabeled data를 이용하여 학습하는 방식이다.
즉, 준지도학습의 목표는 unlabeled data를 이용하여 지도학습의 성능을 더 끌어올리는 것이다.

알다시피, labeled data를 만드는 과정에서 드는 시간과 비용은 크다. 반면, unlabeled data는 labeled data보다 양이 훨씬 많기 때문에 이를 이용하고자 하는 시도는 당연하다고 본다.

위 그래프와 같이 현업 ML/DL 종사자들은 labeled data가 결국 많으면 준지도학습의 효과는 떨어진다고 보고 있다. labeled data가 시간이 들고 비싸서 그렇지 데이터가 많으면 굳이 unlabeled data를 사용할 필요가 없다는 것으로 해석할 수 있다. 하지만 준지도학습에 종사하는 연구자들은 unlabeled data를 활용하는 것의 한계점을 없애는 것이 큰 꿈이자 목표라고 한다.

Outline

Confidence and Entropy
Label Consistency
Regularization
Self-Training

Semi-supervised learning의 핵심 컨셉을 중심으로 Outline을 짜보았다.

Confidence and Entropy

Entropy Minimization

Classifier A가 예측한 y에 대한 확률이 [0.1, 0.8, 0.1]이고,
Classifier B가 예측한 y에 대한 확률이 [0.1, 0.6, 0.3] 이라고 하자.

이때, Classifier A가 더 confident하고 낮은 entropy를 갖는다. (class 1을 예측하는데 좀 더 확신을 갖고 있다.)

이와 같이 모델을 좀 더 confident하게 학습하는 것이 entropy minimization의 목표이다. Loss에 entropy minimization term을 추가하여 학습을 진행한다. 이 term은 unlabeled data에도 적용이 가능하다.

labeled data에 대해서 학습할 때는 objective function에 entropy minimization term을 추가한 형태의 loss로 학습을 진행하고, unlabeled data에 대해서는 entropy minimization term에 대해서만 학습을 진행하는 듯 하다.

Pseudo-Label

이 방법은 매우 간단하기 때문에 많이 사용되는 방법이다.
지도 학습으로 충분히 학습된 모델의 예측 값을 기반으로 threshold와 같은 간단한 규칙으로 unlabeled data에 pseudo-label을 붙인다. 그리고 labeled data와 pseudo-labeled data를 합쳐 모델을 다시 학습한다.

(TMI: 나도 실제로 사용한 경험이 있는데, 지도학습으로 ‘충분히’ 학습이 되어야지 효과를 볼 수 있다고 생각한다. 나의 경우, class 별 imbalance도 심하고 labeled data가 너무 적었기 때문에 fine-tuning이 잘 안되서 pseudo-labeling을 한 결과와 큰 차이가 없었다.)

Label Consistency

데이터에 약간의 변형(augmentation, perturbation)을 가하더라도 데이터의 class는 같다. 즉, 입력 데이터에 변형을 주었을 때에도 모델의 결과가 원본 데이터에 대한 모델의 결과는 비슷해야 한다. 이를 목적으로 모델을 학습하는 컨셉을 Label Consistency이라고 한다.

∏-Model

파이 모델은 label consistency의 가장 기초적인 모델이다. 모델의 학습에 supervised loss와 unsupervised loss가 사용된다.

하나의 입력 데이터에 대하여 총 두 번 augmentation을 진행한다. (이때 augmentation은 이미지에서 기본적인 전처리 단계에서 진행되는 확률적인 rotation, random crop과 같은 것을 의미하는 듯 하다.)

증강된 두 데이터에 대한 모델 결과는 어느 정도 차이가 있을 것이다. (서로 다른 확률의 augmentation과 dropout이 적용되었기 때문)

이 차이 (squared difference)가 unsupervised loss가 되어 supervised loss과 함께 weighted sum으로 학습된다.

Temporal Ensembling

앞의 파이 모델과 다른 점은 ~z_i를 과거 모델 결과의 앙상블로 하였다는 것이다.

Mean Teacher

이전 모델들은 모델을 공유하지만, Mean Teacher는 모델이 Teacher model과 Student model로 나누어져있다.

보통 teacher-student 기반의 모델의 경우 teacher가 큰 규모의 모델이고 weight가 freeze되어 있는 경우가 많은데, 이는 teacher model이 student model의 모델 파라미터의 weighted sum이다. (모델 파라미터가 과거 모델 파라미터의 weighted sum으로 업데이트된다 → 이렇기 때문에 “Mean” Teacher라고 하는듯 하다)

Virtual Adversarial Training

입력 데이터에 간단한 변형을 주는 것이 아닌 adversarial한 변형을 채택한 방법이다. adversarial을 쉽게 설명하자면 loss의 값을 최대한 해치는 방향으로 변형을 하는 것이다.

일반적으로 adversarial training은 label 정보도 같이 사용되지만, virtual adversarial training은 label 정보를 사용하지 않아 semi-supervised learning에 적용 가능하다.

Unsupervised Data Augmentation

Gaussian noise나 Dropout같은 간단한 변형 대신 최신 Data Augmentation 기법을 사용하였다.

이미지, 텍스트 모두 적용 가능하다.

모델 결과 차이에 대한 loss는 KL divergence를 사용하였다.

Regularization

Data Augmentation

사실 이 부분은 Semi-supervised learning은 아니지만 이후 설명할 내용이 이를 Semi-supervised learning에 적용한 것에 대한 설명이기 때문에 넣어보았다.

이미지의 일부 부분을 dropout(cut)하거나 mix하는 전략이다. 네트워크 일반화 능력을 길러준다.

Interpolation Consistency Training

Mixup을 Semi-supervised learning에 적용한 방법이다.

(Mixup한 데이터에 대한 모델 결과)와 (unlabeled sample의 모델 결과의 Mixup) 차이가 consistency loss가 된다.

MixMatch

앞에 나온 entropy minimization, label consistency regularization, mixup을 모두 적용한 방법이다.

MixMatch는 labeled data와 unlabeled data를 받아서 결합된 데이터를 만든다.

Entropy Minimization
Unlabeled data에 대하여 K번의 augmentation을 하고 prediction의 평균을 구하고 그 값을 temperature sharpening을 통하여 sharpen 한다.
MixUp
Augmentation된 labeled, unlabeled 데이터를 섞고, 그 데이터에 대하여 labeled data와 unlabeled data에 MixUp을 한다.
Label Consistency
학습은 다른 모델과 같이 supervised loss는 CE, unsupervised loss는 모델 출력 값의 차이 (L2)가 된다.

Self-Training

Noisy Student

이는 labeled data도 많고, unlabeled data도 많을 때 사용할 수 있는 방법이다.

많은 labeled data로 충분히 학습된 모델이 teacher model이 되어 unlabeled data에 pseudo-labeling을 한다.

그 이후 student model은 모든 데이터 (labeled + pseudo-labeled)에 대하여 noisy하게 학습을 진행한다.
noisy: Data augmentation, Dropout, Stochastic depth

그리고 학습된 student model이 새로운 teacher model이 되어서 unlabeled data에 대해서 pseudo-labeling을 하고 이 절차가 반복되는 것이다.

맨 처음에 teacher model이 pseudo-labeling하는 것 이후에는 student model이 self-training한다.

Reference (paper)

Grandvalet & Bengio. Semi-supervised Learning by Entropy Minimization. NIPS 2005
Dong-Hyun Lee. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. ICML workshop 2013
Laine and Alia. Temporal Ensembling for Semi-Supervised Learning. ICLR 2017
Oliver et al. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms. NerulPS 2018
Tarvainen and Valpora. Mean teachers are better role models: Weighted-averaged consistency targets improve semi-supervised deep learning results. NIPS 2017
Zhang et al. mixup: Beyond Empirical Risk Minimization. ICLR 2018
DeVries and Taylor. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv:1708:04552
Qizhe Xie et al. Unsupervised Data Augmentation for Consistency Training. arXiv: 1904.12848
Sangdoo Yun et al. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. ICCV 2019
Vikas Verma et al. Interpolation Consistency Training for Semi-Supervised Learning. IJCAI 2019
David Berthelot et al. MixMatch: A Holistic Approach for Semi-Supervised Learning. NeurIPS 2019.
Qizhe Xie et al. Self-training with Noisy Student improves ImageNet Classification. CVPR 2020