How to replace text inside an XML element in R?

I have one input XML file.

cat sample.xml

<Text> &lt;p&gt;ABC &lt;/p&gt; </Text> 

R script

 library(XML) doc = xmlTreeParse("sample.xml", useInternal = TRUE) top<-xmlRoot(doc) sub("&lt;","<",top[[1]]) 

How can I fix the above pblm?

Error message : error in as.vector (x, "character"): cannot force type 'externalptr' to a vector of type 'character'

Edit: The goal is to use the readHTMLTable () function for a specific node in xml, which has an html table but has xml markup ( &gt; and &lt; ) for> and <which need to be migrated first, since readHTMLTable cannot process xml markup.

+4
source share
3 answers

If your question is to know how to replace the string in the contents of the XML node, you can check the following code using the provided sample.xml file:

 ## Parse the XML file doc <- xmlTreeParse("sample.xml", useInternal = TRUE) ## Select the nodes we want to update nodes <- getNodeSet(doc, "//Text") ## For each node, apply gsub on the content of the node lapply(nodes, function(n) { xmlValue(n) <- gsub("ABC","foobar",xmlValue(n)) }) 

What will give you:

 R> doc <?xml version="1.0"?> <Text> &lt;p&gt;foobar &lt;/p&gt; </Text> 

Here you can see that "ABC" has been replaced by "foobar".

But, if you try this substitution code that you want to achieve (replace "& lt;" wit "<"), it apparently won't work:

 doc <- xmlTreeParse("sample.xml", useInternal = TRUE) nodes <- getNodeSet(doc, "//Text") lapply(nodes, function(n) { xmlValue(n) <- gsub("&lt;","<",xmlValue(n)) }) 

will provide you with:

 R> doc <?xml version="1.0"?> <Text> &lt;p&gt;ABC &lt;/p&gt; </Text> 

Why? If you are working with XML files, you should be aware that some characters, mainly <,>, and and, are reserved because they are part of the basic XML syntax. Thus, they cannot be displayed in the contents of the nodes, otherwise the parsing failed therefore, they are replaced by entities that are some kind of encoding of these characters, for example, "<" is encoded as "& lt;", "&" is encoded as "&", etc.

So, the contents of your node contains a "<" character, which was automatically converted to its object "& lt;" . What you are trying to do with your code is to replace "& lt;" back with "<", which R will happily do for you, but since this is the textual content of the node, the XML package immediately converts it back to "& lt;" .

So, if you want to convert your string "& lt; p & gt; ABC & lt; / p & gt;" to the new XML node "<p> ABC </p>", you cannot do it this way. The solution would be to parse your text string, determine the name and node (here, β€œp”), create a new node with xmlNode() , provide it with the text content of β€œABC” and replace the string with the created node.

Another quick and dirty way to do this is to first replace all entities in your file without parsing XML. Something like that:

 txt <- readLines(file("sample.xml")) txt <- gsub("&lt;", "<", txt) txt <- gsub("&gt;", ">", txt) writeLines(txt, file("sample2.xml")) doc2 <- xmlTreeParse("sample2.xml", useInternal = TRUE) 

What gives:

 R> doc2 <?xml version="1.0"?> <Text> <p>ABC </p> </Text> 

But this is dangerous because if there is a "real" "& lt;" object in your file, parsing will not be performed.

+5
source

And now the answer to your real question:

sample.xml with an encoded table:

 <Text> &lt;table&gt; &lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;32&lt;/td&gt;&lt;/tr&gt; &lt;/table&gt; </Text> 

Read this:

 > library(XML) > doc = xmlTreeParse("sample.xml", useInternal = TRUE) > top<-xmlRoot(doc) 

Convert to text:

 > table=xmlValue(top) > table [1] "\n<table>\n<tr><td>1</td><td>2</td></tr>\n<tr><td>2</td><td>8</td></tr>\n<tr><td>4</td><td>32</td></tr>\n</table>\n" 

Now it is ready to send on readHTMLTable . No string conversion required:

 > readHTMLTable(table) $`NULL` V1 V2 1 1 2 2 2 8 3 4 32 

Howzat?

+5
source

Enter the node value with xmlValue and replace. Here I am going to replace ABC with DEF:

 > top<-xmlRoot(doc) > top <Text> &lt;p&gt;ABC &lt;/p&gt; </Text> > xmlValue(top)=sub("ABC","DEF",xmlValue(top)) > top <Text> &lt;p&gt;DEF &lt;/p&gt; </Text> 

The reason I'm not trying to replace <because these character sequences are interpreted at some point by XML code:

 > substr(xmlValue(top),6,6)=="<" [1] TRUE 

although I tried to work with some xmlTreeParse parameters and other functions of the XML package, but I can not stop xmlValue from interpreting them ...

+3
source

All Articles