What data model can be used to "value" a page or text

I read on the Internet many times on this subject:

How do you retrieve the page value.

And I know that I am not experienced enough to even try to offer any solution. For me, this is the holy grail of web programming, or perhaps even computer technology in general.

But thanks to the power of imagination, suppose I wrote the final script that does just that. For example, I enter this text:

Imagination led mankind through the dark ages to its current state of civilization. Imagination led Columbus to the discovery of America. Imagination prompted Franklin to discover electricity.

and my powerful script fetches the value and says the following:

The ability of people to think makes them discover new things.

For the purposes of this example, I used "String" to explain the meaning of the text. But if I needed to store this in a database or in an array or in any storage, what type of data will I use?

Please note that I may have different text that uses a different analogy, but still has the same meaning as different, for example:

Imagination helps advance the human race.

Now I can enter a search query about the importance of imagination, and these 2 results will appear. But how will they fit? Will this be a string comparison? Some integers floating points? Maybe even binary?

How will the value be saved? I would like to hear from you.

Update: Let me just ask a question.

How do you represent the value in the data?

+4
source share
7 answers

Assuming that our brain does not have access to the metaphysical cloud server, Value is presented as a configuration of neural connections, hormonal levels, electrical activity - possibly even quantum fluctuations - and the interactions between all this and the outside world and other brains. So, this is good news: at least we know that there is at least one answer to your question (meaning somewhere is presented somewhere). The bad news is that most of us do not know how this works, and those who believe that they understand could not convince others or each other. Being one of the ignorant people, I cannot give an answer to your question, but I give a list of answers that I have come across to smaller and degenerate versions of the great problem.

If you want to represent the meaning of lexical objects (for example, concepts, actions), you can use distributed models, such as vector space models . In these models, as a rule, the geometric component is of importance. Each concept is presented in the form of a vector, and you put concepts into space so that similar concepts are closer to each other. A very simple way to build such a space is to choose a set of commonly used words (basic words) as the dimensions of the space and simply count the number of times the target concept will be observed together in the speech / text with these basic words. Similar concepts will be used in similar contexts; thus, their vectors will indicate similar directions. In addition, you can perform a bunch of weighing, normalizing, reducing dimensions and recombination methods (e.g. tf-idf , http://en.wikipedia.org/wiki/Pointwise_mutual_information , SVD ). A slightly related, but probabilistic - not geometrical - approach is the hidden Dirichlet distribution and other generative / Bayesian models that are already mentioned in another answer.

The vector model approach is good for discriminatory purposes. You can decide whether these two phrases are related semantically or not (for example, matching queries with documents or finding similar pairs of search queries to help the user expand their query). But on these models, it’s not so easy to include syntax. I cannot see very clearly how you could imagine the meaning of a sentence as a vector.

Grammar formalisms could help incorporate syntax and bring structure to meaning and the relationship between concepts (for example, a grammar of the structure of a phrase controlled by voice ). If you create two agents that separate vocabulary and grammar, and make them communicate (i.e. transfer information from one to another) through these mechanisms, you can say that they make sense. It is rather a philosophical question, where and how is the meaning presented when the robot tells another to select the “red circle above the black box” through the built-in or resulting grammar and vocabulary, and the other successfully selects the intended object (see this very interesting experiment on grounding vocabulary: Talking Heads )

Another way to make sense is to use networks. For example, representing each concept as a node in a graph and the relationship between concepts as edges between nodes, you can come up with a practical idea of ​​meaning. The concept of Net is a project whose purpose is to present common sense, and it can be considered as a semantic network of concepts of common sense. In a sense, the meaning of a particular concept is represented through its location relative to other concepts in the network.

Speaking of common sense, Cyc is another ambitious example of a project that tries to capture common sense knowledge, but it does it in a completely different way than Concept Net. Cyc uses a well-defined symbolic language to represent the attributes of objects and the relationships between objects in an unambiguous way. Using a very large set of rules and concepts and a withdrawal mechanism, you can find conclusions about the world, answer questions such as “Can a horse hurt?”, “Bring me a picture of a sad person.”

+6
source

I worked on a system that tried to do this in a previous company. We focused more on “which unstructured documents are most similar to this unstructured document”, but in this part we have defined the “meaning” of the document.

We used two different algorithms: PLSA (probabilistic latent semantic analysis) and PSVM (probabilistic reference vector machines). Both extract themes that are significantly more common in the parsed document than in other documents in the collection.

The themes themselves have numerical identifiers, and the document has an xref table from the document to the topic. To determine how close the two documents are, we look at the percentage of those that have common documents.

Assuming your super script can create themes from the entered query, you can use a similar structure. It has the added benefit of an xref table containing only integers, so you only look at integers, not string operations.

+1
source

Semantics is a wide and deep field, and there are many models, all of which have advantages and problems in terms of AI implementation. With such a deficit of background, it is hardly possible to make a recommendation, in addition to "studying the literature", and choose a theory that resonates with your intuition (and if you are successful at all, replace it with a better theory and gain academic points). "Having said that, the material the freshman course, which I can vaguely recall, used to have good things to say about the recursive structure called the "frame", but it should have been 15 years ago.

0
source

The value in the general case is an abstract concept, which is the internal data structure of the black box, which depends on the selected algorithm. But this is not an interesting part. If you are doing some semantic analysis, the general question is about differences in meanings, for example, if two documents talk about the same topic or how some documents differ, or group documents with similar meanings.

If you are using a vector space model, meaning / semantics can be represented by a set of vectors that represent specific topics. One way to extract such patterns is http://en.wikipedia.org/wiki/Latent_semantic_analysis or http://en.wikipedia.org/wiki/Nonnegative_matrix_factorization . But there are more complex statistical models that represent semantics for the parameters of certain probability distributions. Recent method is http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation .

0
source

I will talk about the Semantic Web because I think it offers the most advanced research and language implementations on the topic.

The Resource Description Framework is one of many data models inherent in the Semantic Web that is available for describing information.

RDF is an abstract model with several serialization formats (i.e. file formats), and thus a special way in which a resource or triple is encoded depending on the format in the format

and

However, in practice, RDF data is often stored in relational databases or native views, also called Triplestores, or Quad if a context (such as a named graph) is also stored for each RDF triple.

RDF content can be obtained using RDF Queries .


“Topic maps” is another model for storing and presenting knowledge data.

Thematic maps are the standard for the presentation and exchange of knowledge with a focus on the availability of information.

and

In 2000, theme maps were defined in XML XTM syntax. It is now commonly known as "XTM 1.0" and is still used quite often.

From the official Theme Map Data Model :

The only atomic fundamental types defined in this part of ISO / IEC13250 (in 4.3) are strings and zeros. By the concept of data types, data of any type can be represented in this model. All data types used must have a string representation of their value space, and this string representation is what is stored in the theme map. Information about which data type the value belongs to is stored separately, in the form of a locator that identifies the data type.

There are many other formats, you can read more about this article .

I also want to associate you with the latest answer . I wrote about a similar topic with a lot of useful links.


After reading various articles, I believe that the general direction that each method takes is storing data in a text format . Relative information can be stored in the database directly as text.

Having data in a clear text format has several advantages, perhaps more than disadvantages.

Other semantic methods such as Notation 3 (N3) or Turtle Syntax use several different formats, but still plain text.

Example N3

@prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Tony_Benn> dc:title "Tony Benn"; dc:publisher "Wikipedia". 

Finally, I would like to link you to a useful article that you should read: Standardizing Unstructured Text Data in a Semantic Web Format .

0
source

Suppose you find the final algorithm that can provide the meaning of the text. In particular, you selected a string representation, but given that your algorithm found the value correctly, it can be uniquely identified by the algorithm. Right?

So, for simplicity, suppose there is only one value for this particular text. In this case, it is uniquely identified before the algorithm outputs a phrase describing it.

So basically, to keep the point, we first need a unique identifier.

Value can exist only in relations with the subject. This is the meaning of the subject. For this item to make sense, we need to know something about it. In order for the subject to have a unique meaning, he must be unambiguously presented to the observer (i.e. the algorithm). For example, the statement "2 = 3" will be false due to the standardization of math symbols. But a text written in a foreign language will not make any difference to us. Nothing that we cannot understand. For example: "What is the meaning of life?"

In conclusion, in order to build an algorithm that can extract the absolute meaning from any random text, we, as people, must first know the absolute meaning of something. :)

In practice, you can only extract the meaning of a known text written in a known language in a known format. And for this there are tools and research in the field of neural networks, natural language processing, etc.

0
source

try to do it in char * (c-style string), it is easily stored in databases and easy to use, will make it 50 (10 words) or 75 (15 words) long

EDIT: put both words on the same word (imagination), then check similar indexes and assign them to one word

using

 SELECT * FROM Dictionary WHERE Index = "Imagination" 

Sorry, I'm not too experienced with SQL

-2
source

All Articles