Yes, DBpedia may be a good choice for this kind of problem. You need
- Align the DBpedia category structure to get the right granularity (for example, Pink Floyd is listed in
Capitol Records artists
and many other categories, but not under Music
). Perhaps select a few large categories and try to determine if your concepts are indicated indirectly in them; - normalize the text; Einstein is listed as
Albert Einstein
, not einstein
- deal with ambiguity due to terms describing several concepts and concepts belonging to several top-level categories.
These problems can be solved with the help of machine learning, but I can only see how this can be done if you extract these terms together with the corresponding functions from the executable text. But in this case, you can simply classify the entire text into one of the categories that you select in step 1.
source share