I am trying to create my own xpath contentHandler for tika that recognizes a complex xpath expression using code from org / apache / tika / sax / BodyContentHandler.java (because I use tika for other things)
This xpath works
/xhtml:html/xhtml:body/descendant:node()
But it is not
//xhtml:div[@id='someid']/descendant:node()
I want to integrate tika contentHandler (because it corrects the content of asymmetric html tags and an invalid character) using the xpath evaluator from javax.xml.xpath. What is the right way to do this. Is there any way to get the original source after tika has rated and committed the html content?
source share