NLP / Quest. The answer is getting information from the database

I recently read a little NLP, and so far I have a (very) basic idea of ​​how everything works, from the layout of the sentence to the marking of POS and the presentation of knowledge.

I understand that there are a large number of NLP libraries (mainly in Java or Python), and they found a .NET implementation ( SharpNLP ). Actually it was great. No need to write any custom processing logic; just use their functions and voila! user input is well-separated and marked with a POS tag.

I don’t understand where to go from here if my main motivation is to create a system of answers to a question (something like chatterbot). What libraries (preferably .NET) are available to me? If I want to build my own design bureau, how should I represent my knowledge? Do I need to parse the entered POS input into something else that my database can understand? And if I use MS SQL, is there any library that helps to map tagged POS data to database queries? Or do I need to write my own database query logic in accordance with procedural semantics (I read)?

The next step, of course, is to formulate a well-constructed answer, but I think I can leave it for later. Right now I am annoyed by the lack of resources in this area ( knowledge representation , NLP for KB / DB-retrieval ), and I would really appreciate it if someone could you offer me your experience :)

+4
source share
1 answer

This is a very broad question, and as such, it is barely suitable for the StackOverflow format, however I would like to strike him.

First, a word about NLP
The wide availability of mature tools in the field of NLP is somewhat misleading in itself. Of course, all / most of the features of NLP, from, say, POS tags or Chunking to, say, automatic summation or recognized entity recognition, are covered and usually well served by the logic and data support of various libraries. However, creating real solutions from these building blocks is hardly a trivial task. Necessary:

  • An architect of a solution along some conveyor or chain through which the results of a particular transformation are fed to the input of subsequent processes.
  • set up individual processes: their computational structure is well established, but they are, however, extremely sensitive to basic data, such as a training / reference body, additional settings, etc.
  • select and confirm the correct functions / processes.

The above is especially difficult for the part of the solution related to the extraction and processing of semantic elements from the text (information "Extract" in general, but also with reference to the reference value, extraction of the relationship or analysis of feelings, to name a few). These NLP functions and corresponding implementations in various libraries are generally more difficult to configure, more sensitive to domain-specific patterns or variations in speech level, or even in the “format” of supporting cases.

In a nutshell, the NLP libraries provide important building blocks for applications like the “Answering Questions” mentioned in the question, but a lot of “glue” and a lot of discretion on how and where to apply the glue (along with a good dose of technology other than NLP, such as the problem of knowledge representation, is discussed below).

About knowledge representation
As outlined above, only those POS tags are not a sufficient element of the NLP pipeline. Essentially, the POS tag will add information about each word in the text, indicating the [probable] grammatical role of the word (as in Noun vs. Adjective vs Verb vs. Pronoun, etc.). This POS information is very useful because it allows, for example, the subsequent division of the text into logically related groups of words and / or a more accurate search for individual words in dictionaries, taxonomies or ontologies.

In order to illustrate the type of information extraction and the basic representation of knowledge that may be required for some “answer system for a question,” I will talk about the general format used in various semantic search systems. Remember, however, that this format may be more conceptual than the prescriptive one for Semantic Search, and that other applications, such as Expert Systems or Translation Machines, still require other forms of knowledge representation.

The idea is to use NLP methods along with data support (from simple “lookup tables” for simple vocabulary, tree structures for taxonomies, ontologies expressed in specialized languages) to extract triplets of objects from text with the following structure:

  • agent: something or someone is "doing" something
  • verb: what is being done
  • object: a person or an element on which "execution" is performed (or more generally, some addition to information about "execution").

Examples:
cat / Agent eat / Verb mouse / Object.
John Grisham / agent write / verb The-Pelican-Brief / Object
cow / agents produce / verbal milk / object

In addition, this type of triplets, sometimes called "facts", can be divided into different types corresponding to certain semantics patterns, usually organized around the semantics of a verb. For example, the “Cause-Effect” facts have a verb that expresses some causality, “Contains” the facts have a verb that implies the container’s attitude toward deterrence, the “determination” facts refer to patterns where the agent / subject [if only partially] is defined by the object ( for example, “cats are mammals”), etc.

You can easily imagine how you can set up such fact databases to provide answers to questions, as well as provide various skills and services, such as replacing a synonym or increasing the relevance of answers to questions (compared to simply matching keywords).

The real difficulty is extracting facts from the text . For this purpose, many NLP functions are involved. For example, one of the steps in the NLP pipeline is to replace pronouns with the numbering they refer to (resolving anaphora or, more generally, resolving a link in NLP lingo). Another step is to identify named objects: names of people, geographical places, books, etc. (NER in LLO NLP). Another step may be to rewrite sentences conjoined by ANDs, as such, to create facts by repeating the grammatical elements that are implied. For example, perhaps the example of John Grisham cited above from a text excerpt, for example, Author J. Grisham was born in Arkansas. He wrote "A time to Kill" in 1989 and "The Pelican Brief" in 1992" Author J. Grisham was born in Arkansas. He wrote "A time to Kill" in 1989 and "The Pelican Brief" in 1992"

Getting John-Grisham/Agent wrote/Verb The-Pelican-Brief/Object implies (among other things):

  • identifying "J. Grisham" and "The Pelican Brief" as specific entities.
  • replaced “He” with “John-Grisham” in the second sentence.
  • rewriting the second sentence as two facts: "John Grisham wrote" Time to Kill in 1989 "and" John Grisham wrote "The-Pelican-brief" in 1992. "
  • discarding part “in 1992” (or, even better, creating another fact, “fact of time”: “The-Pelican-brief / Agent is-related-in-time / verb year-1992 / object”) (By the way, it would also mean identifying 1992 as a year object of time.)

In short: extracting information is a complex task, even if it applies to relatively limited areas and when using existing NLP functions available in the library. This, of course, is a much more “dirty” activity than simply defining nouns from adjectives and verbs; -)

+7
source

All Articles