When you see a hard-coded constant in a machine learning formula, be suspicious ...
The numbers in the Automatic Readability Index are a model that is suitable for the data set used to create it, and the functions selected to represent it. I believe that in addition to the usual suitability, which is a general measure calibrated to the school class, this is another advantage.
Your idea of โโadding word frequency to readability sounds like a great feature. In the end, one unfamiliar word in a simple grammar sentence can turn it into an unreadable one.
You must choose how to present the sentence based on the frequency of words. Examples are the probability of the whole sentence, the number of unusual words, the minimum frequency, etc.
Then you should build a data set and study the parameters of the model. The most straightforward way would be to use a data set of sentences marked manually for readability. However, creating such a data set seems very time consuming.
You can work around this problem using some sources whose readability level is generally known, and mark the readability of sentences according to the source. For example, sentences from simple English Wikipedia should be more readable than sentences from Wikipedia. Other sources of general levels of readability can be wall magazines and web forums. Make some manual markings for these sentences to align and calibrate your readability value.
Using this technique, you compromise tag accuracy for the number of tags. Since it has been proven that machine learning can be performed in the presence of white noise and even malicious errors, such a compromise is usually beneficial.
source share