This note would extend on the state-of-the-art methods in multi-label image classification tasks. This would also touch up on some multi-label class incremental learning although, while this note was written, the work on this topic is rather limited. The main papers this note uses for its research are listed below. A list of full references can be found at the bottom of the page.

MLCIL: Dynamic Prompt Adjustment for Multi-Label Class-Incremental Learning 2024
REVIEW1: Multi-Label Lifelong Machine Learning: A Scoping Review of Algorithms, Techniques, and Applications

Existing Techniques for multi label CIL (REVIEW1)

Regularization Techniques

Minimizes catastrophic forgetting by preventing model overfitting.
Elastic Weight Consolidation¹ - One of the most naive ways to do it; slow down learning by imposing a quadratic penalty on discrepancies found between previously learned parameters and the current task's relevant weights.
Learning without Forgetting (LwF)²: Uses CNNs; Uses Knowledge Distillation to calculate parameters of current task from the losses of previous tasks, in order to keep them similar and not vary too much. It optimizes the parameters of both the shared and current task at the same time in this process.

Rehearsal

Selective reintroduction of past data samples.

Exemplar Rehearsal

Involves storing a buffer of selected past data samples representative of previously observed distribution.
[Examples]
Gradient Episodic Replay³ Does the above (but memory and computational costs to be considered)
Partition Reservoir Sampling⁴: Modified version of Reservoir Sampling Technique - maintains a reservoir of elements that represent a true random sample as the data stream introduces new elements. Each element has an equal probability of getting replaces at each time step by a new element that an better help represent the sample batch. Label frequencies are used to identify majority classes.
Optimizing Class Distribution in Memory (OCDM)⁵: Memory-based technique dynamically maintains a representative distribution of classes in memory (replay buffer) which is a small amount of previously seen data. Memory update mechanism is treated like an optimization problem, where the memory is first filled with observed mini-batches of data and at each time step data in the memory is replaced with a new distribution closest to a target distribution. (sorry if that was a long running sentence)

Generative Replay

Uses generative models like GANs and VAEs for creating synthetic data samples and they are then combined with the current task's data during training.
[Examples]
Deep Generative Replay⁶ uses two models - one generative model to create the synthetic data and the other is a task solver. Reduced working memory requirements since you can delete the old training samples.

Parameter Isolation

Modifies underlying architecture by either updating a fixed number of parameters in the model or by dynamically growing the architecture by either adding new neurons or modules or layers upon each new task. The fixed number of parameters methodology is not worth noting about since it could lead to overparameterization. [Examples]
Progressive Neural Networks⁷: A pool of pre-trained models is retained, one for each learned task. Lateral connections between existing task and learned tasks are learned to facilitate knowledge transfer.

Clustering Techniques

Generating clusters from data points to find target class by finding the closest centroid(s) between the clusters. Conventional kNN eploys a criterion like majority voting to assign a class. Now how do we extend that to multi-class scenarios? Are there any other ways to cluster the data? [Examples]
Roseberry et al. 2021⁸: Introduce a self-adopting algorithm to address multi-label drifting data streams. In other words, when statistical properties in the data change from one task to another. They perform kNN clustering for multi-class setting by using the frequency of each label's occurrence to the closest neighbours in the training set.
Masuyama et al.⁹: They utilize a combination of Bayesian technique and a cognitive neural theory called Adaptive Resonance Theory¹⁰ for tracking label probabilities and for simultaneously solving the stability-plasticity problem respectively. They cluster instances with similar label patterns and identify underlying relations in data, hence reducing the size of the output space learnt. Their algorithm works well with more and more labels.

Binary Classifiers

In this technique, each label is treated independently and it neglects the co-existence of other labels. It decomposes multi label classification into separate binary classification tasks for each label. But disregarding these label relations can cause sub-optimal performances. This is the opposite strategy than using pairwise relations for finding relevant and irrelevant relations. But having high order relations modeled especially in a case where the number of labels are high, then learning from such a vast output space wouldn't be needed when using binary classifiers. [When to use it]
When you have either a high number of labels and/or label dependencies are low for the dataset you're working on.
[Examples]

References

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, and A. Grabska-Barwinska, ‘‘Overcoming catastrophic forgetting in neural networks,’’ Proc. Nat. Acad. Sci. USA, vol. 114, no. 13, pp. 3521–3526, 2017. ↩
Z. Li and D. Hoiem, ‘‘Learning without forgetting,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, Dec. 2018. ↩
D. Lopez-Paz and M. Ranzato, ‘‘Gradient episodic memory for continual learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10. ↩
C. D. Kim, J. Jeong, and G. Kim, ‘‘Imbalanced continual learning with partitioning reservoir sampling,’’ in Computer Vision—ECCV. Glasgow, U.K.: Springer, 2020, pp. 411–428. ↩
Y.-S. Liang and W.-J. Li, ‘‘Optimizing class distribution in memory for multi-label online continual learning,’’ 2022, arXiv:2209.11469. ↩
H. Shin, J. K. Lee, J. Kim, and J. Kim, ‘‘Continual learning with deep generative replay,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–10. ↩
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, ‘‘Progressive neural networks,’’ 2016, arXiv:1606.04671. ↩
M. Roseberry, B. Krawczyk, Y. Djenouri, and A. Cano, ‘‘Self-adjusting K nearest neighbors for continual learning from multi-label drifting data streams,’’ Neurocomputing, vol. 442, pp. 10–25, Jun. 2021. ↩
N. Masuyama, Y. Nojima, C. K. Loo, and H. Ishibuchi, ‘‘Multi-label classification via adaptive resonance theory-based clustering,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8696–8712, Jul. 2023. ↩
G. A. Carpenter and S. Grossberg, ‘‘Adaptive resonance theory,’’ Dept. Cogn. Neural Syst., Center Adapt. Syst., Boston Univ., Boston, MA, USA, Tech. Rep. CAS/CNS TR-98-029, 2010. ↩