My environment:
- CoreNLP 3.5.1
- Stanford-Chinese-corenlp-2015-01-30-models
- Chinese default property file:
StanfordCoreNLP-chinese.propertiesannotators = segment, ssplit
My test text "這是第一個句子。這是第二個句子。"
I get an offer from
val sentences = annotation.get(classOf[SentencesAnnotation])
for (sent <- sentences) {
count+=1
println("sentence{$count} = " + sent.get(classOf[TextAnnotation]))
}
It always prints the entire test text as one sentence, rather than what is expected here:
sentence1 = 這是第一個句子。這是第二個句子。
expected:
expected sentence1 = 這是第一個句子。
expected sentence2 = 這是第二個句子。
Even the same result if I add additional properties, for example:
ssplit.eolonly = false
ssplit.isOneSentence = false
ssplit.newlineIsSentenceBreak = always
ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[!?]+
CoreNLP Magazines
Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
Adding annotator segment
Loading Segmentation Model [edu/stanford/nlp/models/segmenter/chinese/ctb.gz]...Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 files:
edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
loading dictionaries from edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200
done [56.9 sec].
done. Time elapsed: 57041 ms
Adding annotator ssplit
Adding Segmentation annotation...output: [null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null]
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
這是第一個句子。這是第二個句子。
[這是, 第一, 個, 句子, 。, 這是, 第二, 個, 句子, 。]
done. Time elapsed: 419 ms
I once saw someone get the next log (CoreNLP 3.5.0); oddly enough, I don't have this log:
Adding annotator ssplit edu.stanford.nlp.pipeline.AnnotatorImplementations:ssplit.boundaryTokenRegex=[.]|[!?]+|[。]|[!?]+
What is the problem? Is there a workaround? If unsolvable, I can break it myself, but I don’t know how to integrate my splits into the CoreNLP pipeline.