Which of the following techniques is most effective for adapting a pre-trained language model to a new domain with a significantly different vocabulary and linguistic patterns, given a limited amount of labeled data for the target task?
Question 2
In the context of Transformer models, what is the primary computational bottleneck when processing very long input sequences, and how do some advanced architectures attempt to mitigate this?
Question 3
Consider a scenario where a pre-trained language model exhibits strong performance on general language understanding tasks but struggles with factual consistency and hallucination in generative tasks. Which of the following fine-tuning strategies would be most appropriate to address these specific issues?
Question 4
Which of the following statements accurately describes a fundamental difference in how BERT and traditional GPT models handle context during pretraining?
Question 5
When fine-tuning a pre-trained Transformer-based model for a sequence labeling task (e.g., Named Entity Recognition), what is the most common and effective modification made to the model's architecture, and why?