Definition of keywords (programming) language

this is a continuation of my recent question ( Code for defining a programming language in a text file ). I am very grateful for all the answers I received, it helped me a lot. My code for this task is complete and it works quite well - fast and reasonable accuracy.

The method used is as follows: I have a “training” perl script that identifies the most frequently used words in the language, performing a histogram of words from a set of example files. This data is then downloaded by the C ++ program, which then checks the given text and accumulates an estimate for each language based on the words found, and then simply checks which language has the highest score.

Now I would like to do it even better and work a little on the quality of identification. The problem is that I often get "unknown" as a result (many languages ​​accumulate a small score, but nothing more than my threshold). After some debugging, research, etc. I found out that this is probably due to the fact that all words are considered equal. This means that viewing "#include", for example, has the same effect as viewing "while" - both of which indicate that it can be c / C ++ (now I ignore the fact that "while" is used in many other languages), but, of course, in large .cpp files there can be a ton of "bye", but in most cases only a few "#include".

So the fact that "#include" is more important is ignored because I could not figure out how to determine if a word is more important than another. Now bear it in mind that the script that creates the data is pretty dumb, its only histogram of words and for each selected word it gives a rating of 1. It doesn’t even look at the words (so if there is "# & |? /" In the file often it can be chosen as a good word).

I would also like the data creation part to be fully automated, so no one should look at the data and change it, change the grades, change the words, etc. All brains should be in script and cpp.

Does anyone have a suggestion to define keywords or, more importantly, important words? . Some things that can help: I ​​have the number of occurrences of each word and the number of complete words (so the ratio can be calculated). I also thought about destroying characters like; etc., since the histogram script often sets, for example, "continue"; as a result, but the important word is "continue." Last note: all checks for equality are done for exact matching - there are no case-sensitive substrings. This is mainly due to speed, but substrings can help (or hurt, I don't know) ...

NOTE: thanks to everyone who bothered to answer, you helped me a lot.

My work with this is almost finished, so I will describe what I did to get good results.

1) Get a decent set of exercises, about 30-50 files per language from different sources, to avoid coding style bias
2) Write a perl script that executes a histogram of words. Implement blacklist and whitelist (more on this below)
3) add fictitious words to the black list, for example, "license", "etc.". They are often found at the beginning of a file in license information.
4) Add the five most important words in each language to the white list. These are words that are found in most of the source texts of this language, but are not frequent enough to get into the histogram. For example, for C / C ++ I had: #include, #define, #ifdef, #ifndef and #endif in the white list.
5) Highlight the beginning of the file, so give more points to the words found in the first 50-100 lines
6) when executing the histogram word, tokenize the file using @words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be good for most languages ​​that I think (gives me better results). For each language in the final results should be about 10-20 of the most common words.
7) When the histogram is completed, delete all the words found in the black list and add all those that are in the white list
8) Write a program that processes a text file in the same way as a script - tokenize using the same rules. If a word is found in the histogram data, add dots to the correct language. Words in a histogram that correspond to only one language should add more points, those that belong to several languages ​​should add less.

Comments are welcome. Currently, about 1000 text files I get 80 unknowns (mostly on extremely short files - mostly javascript with one or two lines). About 20 files are not recognized correctly. The file size is about 11 kB from 100 bytes to 100 kbytes (11 MB in total). It takes one second to process them all, which is enough for me.

+4
source share
4 answers

I think you are approaching this from the wrong point of view. From your description, it seems you are building a classifier. A good classifier should distinguish between different classes; he does not need to accurately evaluate the correspondence between the input and the most probable class.

Practically: your classifier does not need to determine exactly how close a certain input is to C ++; just need to determine if the input is more C than C ++. This simplifies your work - most of your current "unknown" cases will be close to one or two languages, even if they do not exceed your main threshold.

Now, once you realize this, you will also find out what training your classifier needs: not a random aspect of the sample files, but what distinguishes the two languages. Therefore, when you analyze your C samples and your C ++ samples, you will see that #include does not install them separately. However, class and template will be more common in C ++. On the other hand, #include distinguishes between C ++ and Java.

Of course, there are other aspects besides keywords that you can use. For example, the frequency { , a ; It is similarly different. Another very useful feature for your classifier is comment tokens for different languages. The main problem, of course, will automatically identify them. Again, hardcoding // , /* , ' , -- , # and ! as pseudo-keywords will help.

It also defines another classification rule: SQL often has -- at the beginning of a line, while in C it often appears elsewhere. Thus, it may be helpful for your classifier to also take context into account.

+3
source

Use Google Code Search to find the scales for a set of keywords: #include in C ++ gets 672,000 hits, only in Python ~ 5000.

You can normalize the results by looking at the number of results for the whole language: C ++ gives about 770,000 files, while Python returns 120,000.

Thus, "#include" is extremely rare in Python files, but exists in almost every C ++ file. (Now you still need to learn to distinguish between C ++ and C, of ​​course.) All that remains is to make the right reasoning about probabilities.

+2
source

You need to get exclusivity in your search data.
When learning the programming languages ​​that you expect, you should look for words that are typical of one or more languages. If a word appears in several code files of the same language, but appears in several or none of the other language files, this is a strong sentence in that language. Thus, a word score can be calculated on the search side by selecting words that are exclusive to a language or group of languages. Find a few of these words and get their intersection by adding ratings and finding your language.

+1
source

In response to your other question, someone recommended a Bayesian naive classifier . You must realize this assumption because the technique is good at partitioning according to distinguishing features. You mentioned the while keyword, but it is unlikely to be useful, because so many languages ​​use it, and the Bayes classifier will not treat it as useful.

An interesting part of your problem is how to fake an unknown program. Particles separated by spaces are a decent rough start, but significantly weaken, which will be difficult.

0
source

All Articles