There is a great need for language resources that are built by specialized linguists, and are properly reviewed. There is also a need to coordinate efforts between the different entities in building language resources.
In the previous article, we mentioned that there are three basic dimensions that must exist in machine training data: sufficient data volume; data quality; the ability of data to represent the characteristics of the problem to be addressed. We will focus our attention in this article and the next one on the last two dimensions, which are unfortunately often overlooked, and suffice with the first.
Theoretically, no one disputes the importance of high-quality data; it is definitely better to use good data than to use poor data! But when we actually look at Arabic data resources, the vast majority of them are disappointingly of low quality which hinders processing them adequately. So, why is this the case?
There are many reasons for this poor quality, but we will only mention some of them here. The most important reason is that the process of building Arabic resources is often done by non-specialists rather than language specialists in order to reduce costs. Another reason is the lack of a clear mechanism to review the resources as they are being built, which makes it difficult to detect errors later. Moreover, a relatively large part of the resources is in fact built automatically and then claimed to be built manually.
Another problem with Arabic language resources is the lack of agreed-upon standards in their construction, which often makes it impossible to use several language resources from different sources. In practice, each party (researcher, center, university) builds its linguistic resources in its own way, without any coordination with other parties that work on similar resources. This makes the efforts parallel rather than accumulative.