aschern at SemEval-2020 Task 11: It Takes Three to Tango: RoBERTa, CRF, and Transfer Learning
š” Research Summary
The paper presents a comprehensive system for SemEvalā2020 TaskāÆ11, which focuses on detecting propaganda techniques in news articles. The task is split into two subtasks: Span Identification (SI), which requires locating the textual spans that contain propaganda, and Technique Classification (TC), which assigns one or more propaganda technique labels to each identified span. The authors build their solutions around the RoBERTaālarge preātrained language model, augmenting it with a linearāchain Conditional Random Field (CRF) for the SI subtask, and with additional spanālevel features and transfer learning for the TC subtask. They also devise extensive ruleābased postāprocessing steps and combine multiple models through ensembling.
For SI, the authors first convert the gold span annotations into a BIO tagging scheme (BāPROP, IāPROP, O). A RoBERTaālarge model is fineātuned to predict these tags tokenāwise. Since RoBERTa treats each token independently, the authors add a CRF layer on top of the RoBERTa logits to model label transitions, ensuring that illegal sequences such as O ā IāPROP are penalized. Because RoBERTa uses byteāpair encoding, only tokens that start a word are fed to the CRF, effectively ignoring subāword continuation tokens (e.g., ā##edā). After decoding the CRF output into spans, two postāprocessing operations are applied: (1) spans are trimmed or extended so that both boundaries are alphanumeric characters, correcting tokenization errors; (2) surrounding quotation marks are added when present, as propaganda often appears inside quotes. To improve robustness, two identical architectures are trained with different random seeds; at inference time their predictions are merged, overlapping spans being unified into a superspan. This ensemble reduces variance across runs.
For TC, the problem is multiālabel but the training data contain very few multiālabel instances. The authors therefore transform each multiālabel example into multiple singleālabel copies, training a standard multiāclass classifier. The input format is ā
Comments & Academic Discussion
Loading comments...
Leave a Comment