How to transform (destroy) arbitrary XML into a flat data structure?

Not Actually Duplicate Importing Custom XML into SQL Server

My company has 20 GB of XML files that they want to deal with. The analytics tool that they will use is SAS, which I have never used - someone else will be engaged in the actual production. My task is to find a way to convert XML files into a relatively flat data structure so that they can be imported into SAS. Files come from half a dozen different sources over six years. Although they all nominally describe the same thing - very detailed results of a credit request - they do not correspond to an agreed format, even with files that come from the same source, since the version of the document has changed significantly over time, No XSL, XSD or XSLT documents are available.

It seems that the answer is “you want a database of documents”, but, apparently, SAS needs either something flat, like a CSV or other structure with wide tables, or something relational. My experience is primarily in SQL Server, but if there are solutions focused on other platforms, we are definitely open to this. We even studied using Microsoft Excel, but it does not interpret the file correctly (it parses perfectly, but it gives the columns meaningless names).

I looked at the idea of ​​writing C # code to create an SQL schema based on XML data and hoping that, at least within a separate source, the structures can be made consistent enough to fit all files, I was looking to use SQLXML Bulk Load to create tables but this requires a SQL-annotated XSD schema, and there is no tool to create this .

We examined the use of xml Data Type Methods to get data in a table as follows:

CREATE TABLE ResponseData
(
    CustomerID INT,
    NodePath VARCHAR(500),
    Position SMALLINT,
    Value VARCHAR(500)
)

but feel that there must be a way to get a more useful sharing of information than this.

( qaru.site/questions/1592911/... questions) , XML- SQL, , XML-. " XML " , .

!

+4
2

, , , , , "" ( XML), , "" , . , , ; , , , , , . , , , , .

+3

, , , - .

SAS , , .. XML-. , , , , . , , , . , , , , , .

, XML 32767 SAS (.. SAS).

, , , .

/*****************************************************************************
**  PROGRAM: MACROS.PRXCOUNT.SAS
**
**  RETURNS THE NUMBER OF TIMES A SEGMENT IS FOUND IN AN XML FILE.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1 or 2). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 25-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 14-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
*****************************************************************************/

%macro prxCount(iElement=, iXMLField=, iDelimiterType=1);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxCount.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  prx_id = prxparse("/&delim_open&iElement(\s+.*?&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_count = 0;
  prx_start = 1;
  prx_stop  = length(&iXMLField);
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_count = prx_count + 1;
    call prxposn(prx_id, 1, prx_pos, prx_length);
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx_:;

%mend;






/*****************************************************************************
**  PROGRAM: PRXEXTRACT.SAS
**
**  SEARCHES THROUGH AN XML (OR HTML) FILE FOR AN ELEMENT AND EXTRACTS THE 
**  VALUE BETWEEN AN ELEMENTS TAGS.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iField        : The fieldname to save the result to.
**  iType         : (N or C) for Numeric or Character.
**  iLength       : The length of the field to create.  
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1 or 2). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 14-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 16-FEB-2011  BY:RP
**  - ADDED OPTION TO CHANGE DELIMITERS FROM <> TO []
**  1.1 MODIFIED: 17-FEB-2011  BY:RP
**  - CORRECTED ERROR WHEN MATCH RETURNS A LENGTH OF ZERO
**  - CORRECTED MISSING AMPERSAND FROM IDELIMITERTYPE CHECK.
**  - ADDED ESCAPING QUOTES TO [] DELIMITER TYPE
**  - CORRECTED WARNING WHEN MATCH RETURNS MISSING NUMERIC FIELD
**  1.2 MODIFIED: 25-FEB-2011  BY:RP
**  - ADDED DELIMITER TYPES TO WORK WITH MASKED HTML CODES
**  1.3 MODIFIED: 11-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
**  1.4 MODIFIED: 14-MAR-2011  BY:RP
**  - CORRECTED TO REMOVE FALSE MATCHES FROM PRIOR VERSION. ADDED EXAMPLE.
**  1.5 MODIFIED: 10-APR-2012  BY:RP
**  - CORRECTED PROBLEM WITH ZERO LENGTH STRING MATCHES
**  1.6 MODIFIED: 22-MAY-2012  BY:RP
**  - ADDED ABILITY TO CAPTURE ATTRIBUTES
*****************************************************************************/

%macro prxExtract(iElement=, iField=, iType=, iLength=, iXMLField=, iDelimiterType=1, iSequence=1, iAttributesField=);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxExtract.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  %if %sysfunc(index(&iField,[)) %then %do;
    /* DONT DO THIS IF ITS AN ARRAY */
  %end;
  %else %do;
    %if "%upcase(&iType)" eq "N" %then %do;
      attrib &iField length=&iLength format=best.;
    %end;
    %else %do;
      attrib &iField length=$&iLength format=$&iLength..;
    %end;
  %end;

  /*
  ** BREAKDOWN OF REGULAR EXPRESSION (EXAMPLE USES < AND > AS DELIMS AND ANI AS THE ELEMENT BEING LOOKED FOR:
  **
  ** &delim_open&iElement                            -->  FINDS <ANI
  ** (\s+.*?&delim_close|&delim_close){1}?           -->  FINDS THE SHORTEST SINGLE INSTANCE OF EITHER:
  **                                                      - ONE OR MORE SPACES FOLLOWED BY ANYTHING UNTIL A > CHARACTER
  **                                                      - OR JUST A > CHARACTER
  **                                                      THE ?: JUST TELLS IT NOT TO CAPTURE WHAT IT FOUND INBETWEEN THE ( AND )
  ** (.*?)                                           -->  FINDS WHAT WE ARE SEARCHING FOR AND CAPTURES IT INTO BUFFER 1.
  ** &delim_open                                     -->  FINDS <
  ** \/                                              -->  FINDS THE / CHARACTER. THE FIRST SLASH ESCAPES IT SO IT KNOWS ITS NOT A SPECIAL REGEX SLASH
  ** &iElement&delim_close                           -->  FINDS ANI>
  */
  prx_id = prxparse("/&delim_open&iElement((\s+.*?)&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_start = 1;
  prx_stop = length(&iXMLField);
  prx_sequence = 0;
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_sequence = prx_sequence + 1;
    if prx_sequence = &iSequence then do;
      if prx_length > 0 then do;

        call prxposn(prx_id, 3, prx_pos, prx_length);
        %if "%upcase(&iType)" eq "N" %then %do;
          length prx_tmp_n $200;
          prx_tmp_n = substr(&iXMLField, prx_pos, prx_length);
          if cats(prx_tmp_n) ne "" then do;
            &iField = input(substr(&iXMLField, prx_pos, prx_length), ?best.);
          end;
        %end;
        %else %do;          
          if prx_length ne 0 then do;
            &iField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iField = "";
          end;
        %end;

        **
        ** ALSO SAVE THE ATTRIBUTES TO A FIELD IF REQUESTED
        *;
        %if "%upcase(&iAttributesField)" ne "" %then %do;
          call prxposn(prx_id, 2, prx_pos, prx_length);
          if prx_length ne 0 then do;
            &iAttributesField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iAttributesField = "";
          end;
        %end;

      end;
    end;
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx:;

%mend;

:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><XNI xniattrib=1>7606256091</XNI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=xni, iField=my_xni, iType=c, iLength=15, iXMLField=xml, iSequence=1, iAttributesField=my_xni_attribs);

run;

:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><ANI>7606256091</ANI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=ani2digits, iField=ani2digits, iType=c, iLength=50, iXMLField=xml);

  length ani1-ani6 $15;
  length attr1-attr6 $100;
  array arrani [1:6] $ ani1-ani6;
  array arrattr [1:6] $ attr1-attr6;
  %prxCount  (iElement=ani, iXMLField=xml, iDelimiterType=1);
  do cnt=1 to prx_count;
    %prxExtract(iElement=ani, iField=arrani[cnt], iType=c, iLength=15, iXMLField=xml, iSequence=cnt, iAttributesField=arrattr[cnt]);
  end;

run;
+2

All Articles