Speeding up validation of XML schema for XML file package using the same XML schema (XSD)

I would like to speed up the process of validating a batch of XML files using the same XML Schema (XSD). The only limitations are that I'm in a PHP environment.

My current problem is that the scheme I want to check against includes a rather complex 2755-line xhtml scheme (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). Even for very simple data, this takes a lot of time (about 30 seconds of verification). Since I have thousands of XML files in my batch, this does not scale very well.

To check the XML file, I use both of these methods from the standard php-xml libraries.

  • DOMDocument :: schemaValidate
  • DOMDocument :: schemaValidateSource

I think that the PHP implementation receives the XHTML scheme through HTTP and creates some internal representation (possibly a DOMDocument) and that this is discarded when the validation is complete. I thought some options for XML-libs could change this behavior in order to cache something in this process for reuse.

I created a simple test setup that illustrates my problem:

test-schema.xsd

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://myschema.example.com/" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:myschema="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xs:import schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd" namespace="http://www.w3.org/1999/xhtml"> </xs:import> <xs:element name="Root"> <xs:complexType> <xs:sequence> <xs:element name="MyHTMLElement"> <xs:complexType> <xs:complexContent> <xs:extension base="xhtml:Flow"></xs:extension> </xs:complexContent> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> 

test-data.xml

 <?xml version="1.0" encoding="UTF-8"?> <Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd "> <MyHTMLElement> <xhtml:p>This is an XHTML paragraph!</xhtml:p> </MyHTMLElement> </Root> 

schematest.php

 <?php $data_dom = new DOMDocument(); $data_dom->load('test-data.xml'); // Multiple validations using the schemaValidate method. for ($attempt = 1; $attempt <= 3; $attempt++) { $start = time(); echo "schemaValidate: Attempt #$attempt returns "; if (!$data_dom->schemaValidate('test-schema.xsd')) { echo "Invalid!"; } else { echo "Valid!"; } $end = time(); echo " in " . ($end-$start) . " seconds.\n"; } // Loading schema into a string. $schema_source = file_get_contents('test-schema.xsd'); // Multiple validations using the schemaValidate method. for ($attempt = 1; $attempt <= 3; $attempt++) { $start = time(); echo "schemaValidateSource: Attempt #$attempt returns "; if (!$data_dom->schemaValidateSource($schema_source)) { echo "Invalid!"; } else { echo "Valid!"; } $end = time(); echo " in " . ($end-$start) . " seconds.\n"; } 

Running this schematest.php file leads to the following output:

 schemaValidate: Attempt #1 returns Valid! in 30 seconds. schemaValidate: Attempt #2 returns Valid! in 30 seconds. schemaValidate: Attempt #3 returns Valid! in 30 seconds. schemaValidateSource: Attempt #1 returns Valid! in 32 seconds. schemaValidateSource: Attempt #2 returns Valid! in 30 seconds. schemaValidateSource: Attempt #3 returns Valid! in 30 seconds. 

Any help and suggestions to solve this problem are very welcome!

+4
source share
2 answers

You can safely subtract 30 seconds from time values ​​as service data.

Remote requests to W3C servers are delayed because most libraries do not reflect document caching (even HTTP headers suggest this). But read your own :

W3C servers are slowly returning DTDs. Is the delay intentional?

Yes. Due to the various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we started serving DTDs and circuits (DTDs, XSDs, ENTs, MODs, etc.) from our site using artificial delay. Our goals are to draw more attention to current issues with excessive DTD traffic and to protect the stability and response time of the rest of our site. We recommend HTTP caching or directory files for better performance.

W3.org is trying to keep queries low. It's clear. PHP DomDocument based on libxml. And libxml allows you to install an external object loader. In this case, the whole section of the Support Directory is interesting.

To solve the problem, specify the catalog.xml file:

 <?xml version="1.0"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd" uri="xhtml1-transitional.xsd"/> <system systemId="http://www.w3.org/2001/xml.xsd" uri="xml.xsd"/> </catalog> 

Save a copy of the two .xsd files with the names indicated in this directory file next to the directory (relative as well as absolute paths file:///... work if you prefer a different directory).

Then, make sure your system environment XML_CATALOG_FILES set to the name of the catalog.xml file. When everything is configured, the check is performed only through:

 schemaValidate: Attempt #1 returns Valid! in 0 seconds. schemaValidate: Attempt #2 returns Valid! in 0 seconds. schemaValidate: Attempt #3 returns Valid! in 0 seconds. schemaValidateSource: Attempt #1 returns Valid! in 0 seconds. schemaValidateSource: Attempt #2 returns Valid! in 0 seconds. schemaValidateSource: Attempt #3 returns Valid! in 0 seconds. 

If it still takes a lot of time, this is just a sign that the environment variable is not set to the right place. I processed the variable, as well as some cases of edges, as well as in the blog post:

It should take care of various cross cases, such as file names containing spaces.

Alternatively, you can create a simple external object loader callback function that uses the URL => mapping for the local file system as an array:

 $mapping = [ 'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd' => 'schema/xhtml1-transitional.xsd', 'http://www.w3.org/2001/xml.xsd' => 'schema/xml.xsd', ]; 

As shown in the figure, I put a verbatim copy of these two XSD files in a subdirectory called schema . The next step is to use libxml_set_external_entity_loader to activate the display callback function. Files that exist on the disk are already preferred and are downloaded directly. If a routine encounters a nonfile that does not have a match, a RuntimeException will be RuntimeException with a detailed message:

 libxml_set_external_entity_loader( function ($public, $system, $context) use ($mapping) { if (is_file($system)) { return $system; } if (isset($mapping[$system])) { return __DIR__ . '/' . $mapping[$system]; } $message = sprintf( "Failed to load external entity: Public: %s; System: %s; Context: %s", var_export($public, 1), var_export($system, 1), strtr(var_export($context, 1), [" (\n " => '(', "\n " => '', "\n" => '']) ); throw new RuntimeException($message); } ); 

After installing this external object loader, there is no longer any delay for remote requests.

What is it. See Gist . Take care: this external object loader was written to load an XML file for checking from disk and "resolving" the XSD URI to local file names. Other types of operations (such as DTD-based validation) may require changes or extensions to the code. More preferable is the XML directory. It also works for different tools.

+11
source

As an alternative to @hakre: download the external resource (DTD) first, then use the downloaded version:

 libxml_set_external_entity_loader( function ($public, $system, $context) { if(is_file($system)){ return $system; } $cached_file= tempnam(sys_get_temp_dir(), md5($system)); if (is_file($cached_file)) { return $cached_file; } copy($system,$cached_file); return $cached_file; } ); 
0
source

All Articles