Home

Awesome

Official code for "Rich Feature Construction for the Optimization-Generalization Dilemma"

Overview

In Machine Learning, defining a generalized goal (e.g. the invariant goal in out-of-distribution generalization) and finding a path to the goal (e.g. the many optimization tricks) are two key problems. Usually, there is a dilemma between the two. i.e. either the generalization goal is weak/poor or the optimization process is hard. This optimization-generalization dilemma is especially obvious in the out-of-distribution area. This work tries to solve this dilemma by creating a RICH and SIMPLE representation, such that the optimization process becomes easier with the representation. As a result, we can pursue a stronger generalization goal.

A short Story

Two common questions, in many areas, are: "where is our goal?" and "how to reach the goal from our current position?". A successful project needs to answer both questions. The two questions, however, are contradicted in difficulty. When the goal is ambiguous, normally the path to the goal is blurry. When the path is clear and confident, normally the goal is plain. For the goal of "making a cup of espresso?", for instance, most people can have a clear precise path immediately. On the other hand, "Building a spacecraft to Jupyter" is an ambiguous goal. But most people have no idea about how to achieve it.

Can we build a spacecraft by purely thinking about the "spacecraft"? No. The spacecraft is built based on the development of diverse areas, such as material, computer, engine.

The story above revises the path to hard problems, that is "Search/develop diverse areas (directions). Then a clear path may appear upon them. Otherwise, continuing search more."

The rule above is also the key idea of the proposed Rich Feature Construction (RFC) method.

required environments

Optimization difficulties of OOD methods (ColoredMNIST)

OOD methods are sensitive to the network initialization. We test nine OOD methods, IRMv1, VREx, FISH, SD, IGA, LfF, RSC, CLOvE, fishr, on the ColoredMNIST benchmark. Fig1 shows the OOD performance with different ERM pretrain-epochs. None of the nine OOD methods can work with a random initialzation.

<p align="center"> <image src='figures/anneal_nll_full.png'/> </p> <p align="center"> <em> Fig1: Test performance of nine penalized OoD methods as a function of the number of epochs used to pre-train the neural network with ERM. The final OoD testing performance is very dependent on choosing the right number of pretraining epochs, illustrating the challenges of these optimization problems. </em> </p>

To reproduce the results, run:

bash script/coloredmnist/coloredmnist_anneal.sh

generalization difficulties of OOD methods (ColoredMNIST)

Starting from a 'perfect' initialization where the model only uses the robust feature (OOD performance is maximized), what is going on if we continue training these OOD methods? Will they maintain the robustness? or decay to a spurious/singular solution? Fig2 (top) gives the latter answer.

<p align="center"> <image src='figures/long_train_vstack.png'/> </p> <p align="center"> <em> Fig2: Test performance of OoD methods as a function of training epochs. Top: Six OoD methods are trained from a ‘perfect’ initialization where only the robust feature is well learned. The blue star indicates the initial test accuracy. Bottom: The OoD methods are trained from the proposed (frozen) RFC representation. </em> </p>

To reproduce the results (top), run:

bash script/coloredmnist/coloredmnist_perfect_initialization_longtrain.sh

The proposed RFC on ColoredMNIST

The proposed RFC method creates a rich & simple representation to solve the optimization-generalization dilemma above. Tab1 shows the comparison of Random initialization (Rand), ERM pretrained initialization (ERM), RFC pretrained initialization (RFC / RFC(cf)). The proposed RFC consistantly boost OOD methods.

<p align="center"> <image src="figures/colormnist.png" width="500"/> </p> <p align="center"> <em>Tab1: OoD testing accuracy achieved on the COLORMNIST. The first six rows of the table show the results achieved by six OoD methods using respectively random initialization (Rand), ERM initialization (ERM), RFC initialization (RFC). The last column, RFC(cf), reports the performance achieved by running the OoD algorithm on top of the frozen RFC representations. The seventh row reports the results achieved using ERM under the same conditions. The last row reminds us of the oracle performance achieved by a network using data from which the spurious feature (color) has been removed. </em> </p>

To reproduce the results, run:

bash script/coloredmnist/coloredmnist_rfc.sh

Aiming for the second easiest-to-find feature is not OOD generalization

A line of works seek OOD generalization by discovering the second easiest-to-find features, such as PI. Here we claim that the second easiest-to-find feature is not the robust solution in general. To showcase the idea, we create a 'InverseColoredMNIST' dataset where the robust feature (digits) is more predictive than the spurious feature (color).

<p align="center"> <image src="figures/inversecoloredmnist.png" width="500" /> </p> <p align="center"> <em> Tab2: OoD test accuracy of PI and OOD/ERM methods on COLOREDMNIST and INVERSECOLOREDMNIST. The OOD/ERM methods are trained on top of a frozen RFC representation. </em> </p>

To reproduce the results, run:

bash script/coloredmnist/inversecoloredmnist_rfc.sh

Camelyon17 experiments

Network <br> InitializationMethodsTest Acc <br> IID TuneTest Acc <br> OOD Tunescriptscomments
-ERM66.6±9.870.2±8.7A
ERMIRMv168.6±6.868.5±6.2B
ERMvREx69.1±8.169.1±13.2C
ERMERM(cf)---
ERMIRMv1(cf)69.6±10.570.7±10.0A,D
ERMvREx(cf)69.6±10.570.6±10.0A,E
ERMCLOvE(cf)69.6±10.569.2±9.5A,F
2-RFCERM72.8±3.274.7±4.3A,G, H, Iset lambda=0 in I
2-RFCIRMv171.6±4.275.3±4.8A,G, H, I
2-RFCvREx73.4±3.376.4±5.3A,G, H, J
2-RFCCLOvE74.0±4.676.6±5.3A,G, H, K
2-RFCERM(cf)78.2±2.678.6±2.6A,G, H, L
2-RFCIRMv1(cf)78.0±2.179.1±2.1A,G, H, Lset lambda=0 in L
2-RFCvREx(cf)77.9±2.779.5±2.7A,G, H, M
2-RFCCLOvE(cf)77.8±2.278.6±2.6A,G, H, N
3-RFCERM(cf)72.9±5.373.3±5.3A,G, O, P, Qset lambda=0 in Q
3-RFCIRMv1(cf)72.7±5.575.5±3.8A,G, O, P, Q
3-RFCvREx(cf)72.7±5.475.1±5.3A,G, O, P, R
3-RFCvREx(cf)72.8±5.473.2±7.1A,G, O, P, S
<p align="center"> <image src='figures/lambda_valid_test_irm_vrex_clove.png'> </p>

Citation

If you find our code useful, please consider citing our work using the bibtex:

@inproceedings{zhang2022rich,
  title={Rich feature construction for the optimization-generalization dilemma},
  author={Zhang, Jianyu and Lopez-Paz, David and Bottou, L{\'e}on},
  booktitle={International Conference on Machine Learning},
  pages={26397--26411},
  year={2022},
  organization={PMLR}
}