Artificial Intelligence, text classifier

I am new to AI. I use an application that classifies text through machine learning. An application should classify the various parts of an HTML document. For example, most web pages have a head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document and to identify the different types of forms on the page.

  • It would be very helpful if someone could provide detailed recommendations on this.
  • Examples of such applications will also be very useful.

I am looking for additional technical suggestions regarding code and implementation.

I can assign tags to html tag attributes like class or id

<div class="menu-1"> <div id="entry"> <div id="content"> <div id="footer"> <div id="comment-12"> <div id="comment-title"> 

as for the first element:

TrainClassifier (label: "Menu", value: "menu-1", attribute: "class", line position: "21%", tag: "div");

Inputs

  • "menu-1" (attribute value)
  • List item
  • "class" (attribute name)
  • "21" (tag position in line)
  • "div" (tag name)

Exit

  • Menu (classified as a label)

Which neural network library can accept the above inputs and classify them into shortcuts (for example, menus).

All users cannot create regular expressions or xpath, they need a simpler approach, therefore it is important that the software is intelligent, the user can select a part of the required html document using the web browser control and train the software until it can work itself by oneself.

but I don’t know how to make a program using AI,

The AI ​​that I am looking for, as if it should be able to accept various input data and classify on the basis of this, as I said, a newcomer to AI, know little about it.

It would be useful if I got an answer to a question that I asked, for example, which library should I use and how to implement, the answers suggesting Xpath or Regex or other methods do not answer, it often happens that you get all the suggestions, except the ones you need.

+7
source share
3 answers

I suggest that you first learn simple algorithms that are easy to understand, I can point to some pointers.

  • Naive Bayes (you will find many implementations, but you can do it yourself, algo is simple to implement, but powerful enough).
  • Maximum entropy (e.g. SharpMaxEnt - open source).
  • SVM (e.g. LibSVM for C # port).

    If you want to get an idea of ​​how these works, download the WEKA toolkit:

     http://sourceforge.net/projects/weka/ 

    Commonly performed steps are usually the following:

    • Define as many attributes / functions as possible (and a set of labels).
    • Gather data that are sets {Label, Attribute1, A2, A3, ...}
    • Choose the minimum set of important attributes using function selection algorithms (also available in the WEKA toolkit)
    • Classifier training using a standard algorithm
    • Check the system until you get the required accuracy, feedback, or other parameters.

    Good luck

+3
source

This is a very broad topic. There are several neural network libraries for C #, just find them in Stack Overflow.

You will need to undergo supervised training before you can make any type of classification. In order for ANN to understand what you are throwing at it, you will need to figure out how you will parse the HTML to get the results you are looking for.

As an example, most websites will use CSS to display content in a browser. Other sites may use tables. You will need to train for both.

Your problem is not easy.

+2
source

Classification can help you if you have chunks of data to which you had to assign labels. This is not the case. You would be better off manually writing XPath rules to mark up your documents.

0
source

All Articles