Decoding base64 encoded data from an XML document

I get several xml files with base64 encoded images that I need to decode and save as files.

An example of an unmodified (except zipped) such file can be downloaded below:

20091123-125320.zip (60KB)

However, I get errors such as "Invalid length for Base-64 char array" and "Invalid character in Base-64 string." I marked a line in the code where I get an error in the code.

The file may look like this:

<?xml version="1.0" encoding="windows-1252"?> <mediafiles> <media media-type="image"> <media-reference mime-type="image/jpeg"/> <media-object encoding="base64"><![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]></media-object> <media.caption>What up</media.caption> </media> </mediafiles> 

And the code to handle is as follows:

 var xd = new XmlDocument(); xd.Load(filename); var nodes = xd.GetElementsByTagName("media"); foreach (XmlNode node in nodes) { var mediaObjectNode = node.SelectSingleNode("media-object"); //The line below is where the errors occur byte[] imageBytes = Convert.FromBase64String(mediaObjectNode.InnerText); //Do stuff with the bytearray to save the image } 

The xml data is taken from the corporate newspaper system, so I'm sure the files are in order - and there must be something in the way I process them, this is simply wrong. Maybe a problem with the encoding?

I tried to write the contents of mediaObjectNode.InnerText, and this is base64 encoded data, so the problem with the xml document is not a problem.

I looked for search queries, downloaded, sorted and cried - and did not find a solution ... Help!

Edit:

Added actual example file (and generosity). PLease note that the download file is in a slightly different scheme, since I simplified it in the above example by deleting unnecessary things ...

+4
source share
7 answers

For the first shot, I did not use any programming language, just Notepad ++

I opened the xml file inside and copied and pasted the original base64 content into a new file (without square brackets).

Then I selected everything (Strg-A) and used the Extensions - Mime Tools - Base64 decode option. This gave rise to an error about the wrong length of the text (should be mod 4). So I just added two equal signs ('=') as a placeholder at the end to get the correct length.

Try again and it is successfully decoded into "something." Just save the file as .jpg and it opens like a charm in any image viewer application.

So, I would say that something is wrong with the data you receive. They simply do not have the right number of equal characters at the end to fill in a series of characters that can be broken into packets of 4.

An “easy” way would be to add an equal sign until decoding produces an error. It would be best to count the number of characters (minus CR / LFs!) And add the needed ones in one step.

Further research

After some coding and reading the conversion function, the problem is the incorrect attachment of the equal sign from the manufacturer. Notepad ++ has no problems with tons of identical characters, but MS's Convert function works only from scratch, with one or two characters. Therefore, if you fill out an existing one with additional equal signs, you will also get an error! To make this damn job work, you need to turn off all existing signs, calculate how much is needed and add them again.

For generosity only, here is my code (not absolute, but enough for a good starting point) :; -)

  static void Main(string[] args) { var elements = XElement .Load("test.xml") .XPathSelectElements("//media/media-object[@encoding='base64']"); foreach (XElement element in elements) { var image = AnotherDecode64(element.Value); } } static byte[] AnotherDecode64(string base64Decoded) { string temp = base64Decoded.TrimEnd('='); int asciiChars = temp.Length - temp.Count(c => Char.IsWhiteSpace(c)); switch (asciiChars % 4) { case 1: //This would always produce an exception!! //Regardless what (or what not) you attach to your string! //Better would be some kind of throw new Exception() return new byte[0]; case 0: asciiChars = 0; break; case 2: asciiChars = 2; break; case 3: asciiChars = 1; break; } temp += new String('=', asciiChars); return Convert.FromBase64String(temp); } 
+9
source

The base64 string is invalid, as Oliver said, the string length must be a multiple of 4 after removing spaces. If you look at the end of the base64 line (see below), you will see that the line is shorter than the rest.

 RRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z= 

If you delete this line, your program will work, but the resulting image will have a missing section in the lower right corner. You need to fill this line so that the total length of the line is corect. From my calculations, if you had 3 characters, it should work.

 RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRX//Z= 
+1
source

delete the last 2 characters until the image turns out

 public Image Base64ToImage(string base64String) { // Convert Base64 String to byte[] byte[] imageBytes=null; bool iscatch=true; while(iscatch) { try { imageBytes = Convert.FromBase64String(base64String); iscatch = false; } catch { int length=base64String.Length; base64String=base64String.Substring(0,length-2); } } MemoryStream ms = new MemoryStream(imageBytes, 0, imageBytes.Length); // Convert byte[] to Image ms.Write(imageBytes, 0, imageBytes.Length); Image image = Image.FromStream(ms, true); pictureBox1.Image = image; return image; } 
+1
source

Try using Linq for XML:

 using System.Xml.XPath; class Program { static void Main(string[] args) { var elements = XElement .Load("test.xml") .XPathSelectElements("//media/media-object[@encoding='base64']"); foreach (var element in elements) { byte[] image = Convert.FromBase64String(element.Value); } } } 

UPDATE:

After loading the XML file and analyzing the value of the media-object node, it is clear that this is not a valid base64 string:

 string value = "PUT HERE THE BASE64 STRING FROM THE XML WITHOUT THE NEW LINES"; byte[] image = Convert.FromBase64String(value); 

throws a System.FormatException , saying that the length is not a valid base 64. An event when I remove \n from a line in which it does not work:

 var elements = XElement .Load("20091123-125320.xml") .XPathSelectElements("//media/media-object[@encoding='base64']"); foreach (var element in elements) { string value = element.Value.Replace("\n", ""); byte[] image = Convert.FromBase64String(value); } 

also throws a System.FormatException .

0
source

I also had a problem decoding a Base64 encoded string from an XML document (in particular, an Office OpenXML package document).

It turned out that the line used an additional encoding: HTML encoding, so HTML decoding is performed first, and then Base64 decoding does the trick:

 private static byte[] DecodeHtmlBase64String(string value) { return System.Convert.FromBase64String(System.Net.WebUtility.HtmlDecode(value)); } 

Just in case, someone else is facing the same problem.

0
source

Well, everything is very simple. CDATA is the node itself, so mediaObjectNode.InnerText really produces <![CDATA[/9j/4AAQ[...snip...]P4Vm9zOR//Z=]]> , which is obviously not valid for Base64 data.

To keep everything in order, use mediaObjectNode.ChildNodes[0].Value and pass this value to Convert.FromBase64String' .

-one
source

Is character encoding correct? The error sounds like a problem that causes invalid characters to appear in the array. Try copying text and decoding manually to make sure the data is valid.

(For recording, windows-1252 is not exactly the same as iso-8859-1, so this can be the cause of the problem, except for other sources of corruption.)

-2
source

All Articles