UTF-8 character fatal error (fatal) when parsing XML using XML :: LibXML

I am parsing XML files using XML :: LibXML. For the following XML entry, I get an error:

Malformed UTF-8 character (fatal) at C:/Perl64/site/lib/XML/LibXML/Error.pm line 217 

which the

 $context=~s/[^\t]/ /g; 

The entry in XML is as follows

 <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID Version="1">15177811</PMID> <DateCreated> <Year>2004</Year> <Month>06</Month> <Day>04</Day> </DateCreated> <DateCompleted> <Year>2004</Year> <Month>08</Month> <Day>11</Day> </DateCompleted> <DateRevised> <Year>2011</Year> <Month>04</Month> <Day>07</Day> </DateRevised> <Article PubModel="Print"> <Journal> <ISSN IssnType="Print">0278-2626</ISSN> <JournalIssue CitedMedium="Print"> <Volume>55</Volume> <Issue>2</Issue> <PubDate> <Year>2004</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title>Brain and cognition</Title> <ISOAbbreviation>Brain Cogn</ISOAbbreviation> </Journal> <ArticleTitle>Efficiency of orientation channels in the striate cortex for distributed categorization process.</ArticleTitle> <Pagination> <MedlinePgn>352-4</MedlinePgn> </Pagination> <Affiliation>Cognitive Science Department, Université de Liège, Belgium. mmermillod@ulg.ac.be</Affiliation> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Mermillod</LastName> <ForeName>Martial</ForeName> <Initials>M</Initials> </Author> <Author ValidYN="Y"> <LastName>Chauvin</LastName> <ForeName>Alan</ForeName> <Initials>A</Initials> </Author> <Author ValidYN="Y"> <LastName>Guyader</LastName> <ForeName>Nathalie</ForeName> <Initials>N</Initials> </Author> </AuthorList> <Language>eng</Language> <PublicationTypeList> <PublicationType>Journal Article</PublicationType> </PublicationTypeList> </Article> <MedlineJournalInfo> <Country>United States</Country> <MedlineTA>Brain Cogn</MedlineTA> <NlmUniqueID>8218014</NlmUniqueID> <ISSNLinking>0278-2626</ISSNLinking> </MedlineJournalInfo> <CitationSubset>IM</CitationSubset> <CommentsCorrectionsList> <CommentsCorrections RefType="ErratumIn"> <RefSource>Brain Cogn. 2005 Jul;58(2):245</RefSource> </CommentsCorrections> <CommentsCorrections RefType="RepublishedIn"> <RefSource>Brain Cogn. 2005 Jul;58(2):246-8</RefSource> <PMID Version="1">16044513</PMID> </CommentsCorrections> </CommentsCorrectionsList> <MeshHeadingList> <MeshHeading> <DescriptorName MajorTopicYN="Y">Neural Networks (Computer)</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Neurons</DescriptorName> <QualifierName MajorTopicYN="N">physiology</QualifierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Orientation</DescriptorName> <QualifierName MajorTopicYN="Y">physiology</QualifierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Pattern Recognition, Visual</DescriptorName> <QualifierName MajorTopicYN="Y">physiology</QualifierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Visual Cortex</DescriptorName> <QualifierName MajorTopicYN="Y">physiology</QualifierName> </MeshHeading> </MeshHeadingList> </MedlineCitation> 

But what I want from this entry is PMID, DateRevised, PubDate, ArticleTitle, CommentsCorrectionList and MeshHeadingList. But if I delete Affiliation that contains some other character, this error will no longer be. How to fix this error?

0
perl parsing utf-8 xml-libxml
Oct 05 2018-11-11T00:
source share
1 answer

You can either convert the file to the specified encoding (UTF-8), or specify the encoding actually used for the file. ( <?xml version="1.0" encoding="cp1252"?> ).

Notepad can be used to convert to UTF-8, and therefore Perl:

 perl -pe" BEGIN { binmode STDIN, ':encoding(cp1252)'; binmode STDOUT, ':encoding(UTF-8)'; } " < file.cp1252 > file.UTF-8 

(You will need to remove the line breaks that I added for readability.)

+4
Oct 05 '11 at 18:56
source share



All Articles