Choosing the structure of linguistic resources and identifying the appropriate tags to solve the problem is not a simple process, and it depends on the understanding of the problem and the way in which it is approached.
As mentioned in the second article, there are three basic dimensions that must be present in machine training data: adequate data volume; data quality; the ability of data to represent the characteristics of the problem to be addressed.
It is this last dimension that presents the greatest challenge when building language resources. Even if the data is large and of high quality, this is not sufficient for successful machine learning.
Most machine learning processes rely on tagging corpora with specific tags. These tags actually represent the dimensions of the problem to be solved according to the point of view of the person who chose them. When chosen tags are not suitable or fail to represent the dimensions of the problem, the machine simply will not learn how to solve the problem, or it will have little learning.
If we take, as an example, the simple task of identifying parts of speech that we discussed in article 2, we find that choosing a set of tags for parts of speech is the first challenge before the labeling process itself. For example, will we just label the verb (VERB) for all verbs, or do we have to label each type of verb (VERB_PAST, VERB_PRESENT, VERB_IMPARATIVE)? If we choose the second option, do we consider what is traditionally called nominative pronouns (فعلتُ I did, فَعَلْنا we did, اِفْعَلي you do [imperative, fem.], etc.) parts of the verb or separate parts? What about present tense letters, and the present verb cases i.e. nominative, subjunctive, and jussive? In fact, it is not possible to skip these questions and dozens like them before starting the process of labeling the verb, otherwise the result – no matter how good it is – will still fall short of helping the machine to learn correctly.
Other times, the data is not tagged, but it is organized in a special way, like parallel corpora. In this case, too, determining the structure and characteristics of the dataset plays a crucial role in the machine’s ability to learn later.
But in fact, defining the data structure and assigning the appropriate set of tags is intrinsically linked to the process of defining, describing, and framing the problem, which, as we saw in the first article, is the main reason for the failure of most Arabic language programs to reach high accuracy levels.
To summarize what was mentioned in the four articles, there are four reasons behind the failure of most of the Arabic language programs to reach high accuracy levels, and these four reasons are: moving to develop solutions to the problem directly before defining, describing, and framing it; the belief that a large amount of data alone is sufficient for the success of machine learning; low data quality; And the lack of organizing and labeling the data in a way that represents the linguistic problem to be solved.