Modern algorithm for reading sentences

Question

Modern algorithm for reading sentences

I am working on an algorithm for estimating the complexity of a sentence , but the methods I found seem too old to take advantage of what modern computers can do.

The algorithms used were mainly developed about 40-60 years ago. Flesch-Kincaid is the most popular and is still used as the standard for documents from the Department of Defense and many states and enterprises. I looked at the Flesch-Kincaid class level, the Gunning Fog index, the SMOG index, the Fry readability formula, and the Coleman-Liau index.

I decided to use an automated readability index:

ARI = 4.71 * (characters / words) + .5 * (words / sentences) - 21.43;

It seems to me that it’s easy to assign a meaning to each word based on Corpus word lists of words, and then use these meanings in the old readability formula.
This can be done for the first 1000 - 5000 most frequently occurring words. It would also probably be advisable to make separate lists for different words and parts of speech. The presence of conjunctions will definitely be a sign of the complexity of the proposal.

Are there any formulas for this?

+4

machine-learning nlp text-mining

oppositefrog Dec 10 '12 at 4:28

source share

1 answer

Dal · Answer 1 · 2017-04-24T08:30:32+0000

When you see a hard-coded constant in a machine learning formula, be suspicious ...

The numbers in the Automatic Readability Index are a model that is suitable for the data set used to create it, and the functions selected to represent it. I believe that in addition to the usual suitability, which is a general measure calibrated to the school class, this is another advantage.

Your idea of adding word frequency to readability sounds like a great feature. In the end, one unfamiliar word in a simple grammar sentence can turn it into an unreadable one.

You must choose how to present the sentence based on the frequency of words. Examples are the probability of the whole sentence, the number of unusual words, the minimum frequency, etc.

Then you should build a data set and study the parameters of the model. The most straightforward way would be to use a data set of sentences marked manually for readability. However, creating such a data set seems very time consuming.

You can work around this problem using some sources whose readability level is generally known, and mark the readability of sentences according to the source. For example, sentences from simple English Wikipedia should be more readable than sentences from Wikipedia. Other sources of general levels of readability can be wall magazines and web forums. Make some manual markings for these sentences to align and calibrate your readability value.

Using this technique, you compromise tag accuracy for the number of tags. Since it has been proven that machine learning can be performed in the presence of white noise and even malicious errors, such a compromise is usually beneficial.

Modern algorithm for reading sentences

More articles: