How to convert text file to CoNLL format for malt analyzer?

I am trying to use a malt analyzer with a pre-made English model. However, I don’t know how to convert the text body of English sentences into the CoNLL format necessary for the malt parser to work. I could not find the documentation on the site. How should I do it?

Update I mean this post Create a .conll file as Stanford Parser output to create a .conll. However, it uses the Stanford Parser.

+4
source share
1 answer

There is a CoNLL formatting option for CoreNLP output, but unfortunately it does not meet MaltParser's expectations. (Funny, there are several different common CoNLL data formats for different years of competition ..)

If you run CoreNLP from the command line with the option -outputFormat conll, you will get the output in the following TSV format (example output at the end of the answer):

INDEX    WORD    LEMMA    POS    NER    DEPHEAD    DEPREL

MaltParser expects a slightly different format, but you can customize the input / output format of the data. Try placing this content in maltparser/appdata/dataformat/myconll.xml:

<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="myconll" reader="tab" writer="tab">
    <column name="ID" category="INPUT" type="INTEGER"/>
    <column name="FORM" category="INPUT" type="STRING"/>
    <column name="LEMMA" category="INPUT" type="STRING"/>
    <column name="POSTAG" category="INPUT" type="STRING"/>
    <column name="NER" category="IGNORE" type="STRING"/>
    <column name="HEAD" category="HEAD" type="INTEGER"/>
    <column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
</dataformat>

Then add the MaltParser to the configuration file (find config in example maltparser/examples/optionexample.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment>
    <optioncontainer>
...
        <optiongroup groupname="input">
            <option name="format" value="myconll"/>
        </optiongroup>
    </optioncontainer>
...
</experiment>

You should then provide CoreNLP CoNLL output as training data for MaltParser.

Unconfirmed, but if the MaltParser docs are honest, this should work. Sources:


CoreNLP CoNLL ( tokenize,ssplit,pos):

$ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null

1   This    this    DT  _   _   _
2   is  be  VBZ _   _   _
3   a   a   DT  _   _   _
4   test    test    NN  _   _   _
5   .   .   .   _   _   _
+8

All Articles