Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT
📝 Original Info
- Title: Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT
- ArXiv ID: 1605.07346
- Date: 2016-05-25
- Authors: Abdelaziz Lakhfif, Mohammed T. Laskri, Eric Atwell
📝 Abstract
In this paper, we present an ongoing effort in lexical semantic analysis and annotation of Modern Standard Arabic (MSA) text, a semi automatic annotation tool concerned with the morphologic, syntactic, and semantic levels of description.💡 Deep Analysis

📄 Full Content
In this paper, we present an ongoing effort in lexical semantic analysis and annotation of Modern Standard Arabic (MSA) text, a semi automatic annotation tool concerned with the morphologic, syntactic, and semantic levels of description. Besides the aim of providing a multi-level annotation tool for Arabic corpora, our goals are (1) to investigate the suitability of Frame Semantics (FS) approach (Fillmore 1985) for representing and analysing Arabic text (2) to provide corpus-attested linguistics materials for frame-based contrastive text analysis between Arabic and English in terms of lexicalization patterns; (3) to automatically derive mappings rules from annotated sentences. Such corpus-attested mapping rules between linguistic form and its meaning can support semantic analysis in knowledge-based NLP systems such as machine translation, information extraction etc.
Following syntactically-based annotation projects for English, serious attempts have been made to annotate Arabic corpora, such as the Penn Arabic Treebank (PATB), (Maamouri et al. 2004) and the Quranic Arabic Dependency Treebank (QADT) (Dukes et al. 2010).
However, semantically-based annotation for Arabic corpora has not yet been garnering the same attention.
Our semantic representations are based upon use of frame-semantic paradigm; it is actually used in a MT system from Arabic to Algerian Sign Language aimed to assist deaf children and in order to bridge the gap between Arabic written texts and Algerian Sign Language (Lakhfif andLaskri 2010a,b, 2011).
Annotation outputs are available in XML format compatible with the FrameNet project (Fillmore and Petruck 2003) design and can be portable to other NLP systems.
During the last decade, the emergence of semantically rich lexical resources has been one of the most striking advances in the NLP domain.
The annotation also make use of publicly existing lexical semantic resources such as Buckwalter Arabic Morphological analyzer (BAMA) (Buckwalter 2002) lexicon, FrameNet (Baker et al. 1998), and Arabic WordNet (AWN) (Elkateb et al. 2006), in order to facilitate and scale up to wide-coverage annotation tasks.
FrameNet project is the most important online semantic lexicons resource for the English language, based on Fillmore’s Frame Semantics tenet. The database is validated with semantically and syntactically annotated texts from real linguistic data. The success of the Berkeley FrameNet for English motivates similar projects to start producing comparable frame-semantic lexicons for other language such as German, Japanese, Spanish, Danish, Swedish, etc. (Boas 2009). For our project, we are re-using frames from the Berkeley FrameNet project for English to construct an equivalent FrameNet for Arabic following the process described in (Fillmore and Petruck 2003). We have also used AWN to extend the FrameNet coverage for Arabic.
AWN is a lexical database of Arabic words in terms of a large semantic network. AWN follows the methodology developed for EuroWordNet and it was mapped to Princeton WordNet 2.0 and SUMO ontology.
BAMA is a rule based morphological analyzer that provides for every lexicon entry, a morphological compatibility category, an English gloss and occasional part-of-speech (POS) tags.
In order to cover linguistic variation of the language, we aim to use several Arabic corpora from different sources such as Quranic texts, and CCA Corpus of Contemporary Arabic (Al-Sulaiti and Atwell 2006). We have started with a collection of Arabic texts from Algerian primary school educational books, actually used as a development corpus in our Arabic text-to-Sign Language MT system for Algerian deaf children. As regard sentences extraction from corpora, we follow FrameNet Project data corpus collection and organisation. So for each word target, sentences containing the word with the appropriate sense are extracted from a different corpus sources and classified into sub-corpora by syntactic pattern. Each annotation level consists of some label sets arranged in layers of annotation such as “FE” for Frame Element annotation, “GF” for Grammatical Function annotation, “Sumo” for Sumo concept annotation, etc. Data annotations are organized by Lexical Unit, that is, one file contains data for all layers. Also, the tool can provide separated data annotation for selective layers and it can be performed in incremental fashion. In the pilot phase of the annotation, more than 1,000 Arabic sentences expressing Motion events were annotated with our tool.
Besides diacritics to mark the syntactic cases, Arabic encodes nominal categories with inflectional features like gender, number and definiteness. The verb encodes gender, number, person, tense, voice and mood features. Syntactically, Arabic is a prodrop language. Th
📸 Image Gallery
