If your question is to know how to replace the string in the contents of the XML node, you can check the following code using the provided sample.xml file:
#
What will give you:
R> doc <?xml version="1.0"?> <Text> <p>foobar </p> </Text>
Here you can see that "ABC" has been replaced by "foobar".
But, if you try this substitution code that you want to achieve (replace "& lt;" wit "<"), it apparently won't work:
doc <- xmlTreeParse("sample.xml", useInternal = TRUE) nodes <- getNodeSet(doc, "//Text") lapply(nodes, function(n) { xmlValue(n) <- gsub("<","<",xmlValue(n)) })
will provide you with:
R> doc <?xml version="1.0"?> <Text> <p>ABC </p> </Text>
Why? If you are working with XML files, you should be aware that some characters, mainly <,>, and and, are reserved because they are part of the basic XML syntax. Thus, they cannot be displayed in the contents of the nodes, otherwise the parsing failed therefore, they are replaced by entities that are some kind of encoding of these characters, for example, "<" is encoded as "& lt;", "&" is encoded as "&", etc.
So, the contents of your node contains a "<" character, which was automatically converted to its object "& lt;" . What you are trying to do with your code is to replace "& lt;" back with "<", which R will happily do for you, but since this is the textual content of the node, the XML package immediately converts it back to "& lt;" .
So, if you want to convert your string "& lt; p & gt; ABC & lt; / p & gt;" to the new XML node "<p> ABC </p>", you cannot do it this way. The solution would be to parse your text string, determine the name and node (here, βpβ), create a new node with xmlNode() , provide it with the text content of βABCβ and replace the string with the created node.
Another quick and dirty way to do this is to first replace all entities in your file without parsing XML. Something like that:
txt <- readLines(file("sample.xml")) txt <- gsub("<", "<", txt) txt <- gsub(">", ">", txt) writeLines(txt, file("sample2.xml")) doc2 <- xmlTreeParse("sample2.xml", useInternal = TRUE)
What gives:
R> doc2 <?xml version="1.0"?> <Text> <p>ABC </p> </Text>
But this is dangerous because if there is a "real" "& lt;" object in your file, parsing will not be performed.