I am trying to read an xml based file format called mzXML using SAX in JAVA. It contains partially encoded mass spectrometric data (signals with intensity).
This is what looks of interest of interest (more info on this):
<peaks ... >eJwBgAN//EByACzkZJkHP/NlAceAXLJAckeQ4CIUJz/203q2...</peaks>
The full file that causes the Error in my case can be downloaded here .
A line in one of these records contains about 500 compressed and base64-encoded doubling pairs (signals and intensities). What I am doing is unpacking and decoding to get the values (decoding is not shown in the example below). This all works fine on a small dataset. Now I used a big one, and I ran into a problem that I don't understand:
The character procedure (ch, start, length) does not read the full record in the line shown earlier. The length-value seems to be small.
I did not see this problem when I just printed the peak record on the console, as there are many letters, and I did not know that the letters were gone. But decompression fails when information is missing. When I run this program repeatedly, it always splits the same line at the same point without giving exceptions. If I modify the mzXML file, for example. removing scan, it is torn in a different position. I found this using breakpoints in the character () routine while looking at the contents of currentValue
Here is the code snippet needed to repeat the problem:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;
import javax.xml.bind.DatatypeConverter;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class ReadXMLFile {
public static byte[] decompress(byte[] data) throws IOException, DataFormatException {
Inflater inflater = new Inflater();
inflater.setInput(data);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length);
byte[] buffer = new byte[data.length*2];
while (!inflater.finished()) {
int count = inflater.inflate(buffer);
outputStream.write(buffer, 0, count);
}
outputStream.close();
byte[] output = outputStream.toByteArray();
return output;
}
public static void main(String args[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
boolean peaks = false;
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("PEAKS")) {
peaks = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (peaks) {peaks = false;}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (peaks) {
String currentValue = new String(ch, start, length);
System.out.println(currentValue);
try {
byte[] array = decompress(DatatypeConverter.parseBase64Binary(currentValue));
System.out.println(array[1]);
} catch (IOException | DataFormatException e) {e.printStackTrace();}
peaks = false;
}
}
};
saxParser.parse("file1_zlib.mzxml", handler);
} catch (Exception e) {e.printStackTrace();}
}
}
Is there a safer way to read large XML files? Can you tell me where the error comes from or how to avoid it?
Thanks Michael