For machine learning to succeed, it is not enough to have huge data. Data also needs to be of high quality and designed in a way that approaches the problem properly.
In the previous article, we discussed the first reason that we believe lies behind the low effectiveness of Arabic programs dealing with language problems. It was simply the tendency to move directly to developing solutions to the problem before defining the problem, describing it, and developing the proper frame for it.
In this article, we will focus on the second reason, which is related to machine learning solutions. It is known that machine learning depends fundamentally on two components: algorithms and models on the one hand, and training data on the other hand.
We will not discuss the first component because it is outside our competence, and because these algorithms and models are – as far as we know – of a global nature and are not particularly related to one language more than another. We will focus our attention on the second component, which is the training data.
There are three basic dimensions that must exist in the training data of any machine learning model so that the machine can learn successfully and simulate a human’s ability to perform linguistic tasks. These dimensions are data volume, data quality, and how accurate data is in representing the features of the problem to be addressed.
Some people believe that obtaining massive data is all what the machine needs to gain the human capacity for language. But this belief would be practically invalidated at the first test. Data volume is undoubtedly very important, but it is definitely insuffient. For example, training a machine on large data of the same type will limit its learning to that type only.
Let’s take a practical example of this from a very simple task in language processing, which is the determination of the part of speech (PoS). This task is one of the basic and simple tasks of text analysis. Other more complex tasks are built on this task. In this task, the machine learning model is trained on a corpus that is annotated with PoS tags, so that the machine learns the different contexts the words are used in and what PoS each context entails.
Unfortunately, when testing the majority of solutions available in the market for this simple problem, we find that they fail to process simple educational or literary texts, and do not succeed in identifying parts of speech with high accuracy. The reason for this is absolutely not due to an error in algorithms or machine learning models, but because most training corpora contain text from only one domain, which is the news, because of its accessibility. The field of news is certainly important, but it is only a small part of the many language domains.