Prefix-Tuning: Optimizing Continuous Prompts for Generation

Summary. This paper proposes the addition of virtual tokens as prefix to language model inputs. The authors train their embeddings to guide models to perform specific tasks, such as converting tables to words or summarization. Notably, the authors experiment with GPT2, which is difficult to guide using only natural language prompts.

Discussion.

Relevance. The main motivation for this work is that GPT2 isn’t good at following human instructions for task such as “convert this table to words”. Now that large language models such as ChatGPT becomes exceptionally good at following human instructions (i.e., performing zero-shot learning), do we still need prefix tuning? We think it might still be beneficial, in particular if the difficulty of writing prompts can justify the cost of training the model on task-specific dataset. However, the cost of back-propagation on any models is in general 3x the forward pass.

Reasons for superior performance. Why prefix-tuning outperforms full model fine-tuning in low data regime or generalization? Possibly it’s due to the lower number of parameters involved in learning, which may help mitigate against overfitting. Learning a large number of parameters especially in the low-data regime may be particularly prone to overfitting. Perhaps a better comparison between prefix and full-model fine-tuning is to fine-tune the same number of parameters.

Compositionality. When task instructions are given in natural language descriptions, they are composable. For example, to make the model first convert a table into words and then generate a summary, we can simply concatenate these two separate instructions together. However, can we concatenate fine tuned virtual tokens together as virtual tokens of the composed task? Given that word embedding arithmatics often does not work as cleanly as one would hope (e.g. this post), it is unlikely that these virtual tokens will be composable.