I want to split an XML Like string into tokens in C # or sql

I want to split an XML Like string into tokens in C # or sql. for example, the input line is like

<entry><AUTHOR>C. Qiao</AUTHOR> and <AUTHOR>R.Melhem</AUTHOR>, "<TITLE>Reducing Communication </TITLE>",<DATE>1995</DATE>. </entry> 

and I want this conclusion:

 C AUTHOR . AUTHOR Qiao AUTHOR and R AUTHOR . AUTHOR Melhem AUTHOR , " Reducing TITLE Communication TITLE " , 1995 DATE . 
+4
source share
2 answers

This is the first attempt to solve this problem, given the following:
1. The XML string will be valid (ie there will be no invalid characters between tags)
Like this:

 string xml = @"<ENTRY><AUTHOR>C. Qiao</AUTHOR> <AUTHOR>R.Melhem</AUTHOR> <TITLE>Reducing Communication </TITLE> <DATE>1995</DATE> </ENTRY>"; 

2. Separation will be performed by a space ' '

 string xml = @"<ENTRY><AUTHOR>C. Qiao</AUTHOR> <AUTHOR>R.Melhem</AUTHOR> <TITLE>Reducing Communication </TITLE> <DATE>1995</DATE> </ENTRY>"; XElement doc = XElement.Parse(xml); foreach (XElement element in doc.Elements()) { var values = element.Value.Split(' '); foreach (string value in values) { Console.WriteLine(element.Name + " " + value); } } 

Will be printed

 AUTHOR C. AUTHOR Qiao AUTHOR R.Melhem TITLE Reducing TITLE Communication TITLE DATE 1995 

EDIT:

Now to smash based on "." and space, it is best to use regular expression. Like this:

  var values = Regex.Split(element.Value, @"(\.| )"); foreach (string value in values.Where(x=>!String.IsNullOrWhiteSpace(x))) { Console.WriteLine(element.Name + " " + value); } 

You can add more separators if you want. The following example will give you the following:

 AUTHOR C AUTHOR . AUTHOR Qiao AUTHOR R AUTHOR . AUTHOR Melhem TITLE Reducing TITLE Communication DATE 1995 

Edit2:
And here is an example that works with your source string, this is most likely not the best approach, since it does not have the correct token location, but it should be pretty close:

  string xml = @" <entry> <AUTHOR>C. Qiao</AUTHOR> and <AUTHOR>R.Melhem</AUTHOR>, ""<TITLE>Reducing Communication </TITLE>"" ,<DATE>1995</DATE>. </entry>"; //Parse xml to XDocument XDocument doc = XDocument.Parse(xml); // Get first element (we only have one) XElement element = doc.Descendants().FirstOrDefault(); //Create a copy of an element for use by child elements. XElement copyElement = new XElement(element); //Remove all child nodes from root leaving only text element.Elements().Remove(); //Splitting based on the tokens specified var values = Regex.Split(element.Value, @"(\.| |\,|\"")"); foreach (string value in values.Where(x => !String.IsNullOrWhiteSpace(x))) { Console.WriteLine(value); } //Getting children nodes and splitting the same way foreach (XElement elem in copyElement.Elements()) { var val = Regex.Split(elem.Value, @"(\.| |\,|\"")"); foreach (string value in val.Where(x => !String.IsNullOrWhiteSpace(x))) { Console.WriteLine(value + " " + elem.Name); } } //You can try to play with DescendantsAndSelf //to see if you can do it in single action and with order preserved. //foreach (XElement elem in element.DescendantsAndSelf()) //{ // //.... //} 

This will print the following:

 and , " " , . C AUTHOR . AUTHOR Qiao AUTHOR R AUTHOR . AUTHOR Melhem AUTHOR Reducing TITLE Communication TITLE 1995 DATE 
+1
source

Edit: I just noticed that I read the question incorrectly - copying the formatted XML from the first answer, and not from the question, I did not notice the nodes of mixed content inside the line. This makes the process easier. The solution might look like this:

 using System; using System.Linq; using System.Text; using System.Xml; using System.Xml.Linq; class Program { static void Main(string[] args) { var xml = @"<entry><AUTHOR>C. Qiao</AUTHOR> and <AUTHOR>R.Melhem</AUTHOR>, ""<TITLE>Reducing Communication </TITLE>"",<DATE>1995</DATE>. </entry>"; var elem = XElement.Parse(xml); var tokFunc = new Func<XNode, string>(node => { var s = node.ToString().Replace(".", " . ").Replace(",", " , "); var nodeName = node.Parent != null && node.Parent.NodeType == XmlNodeType.Element && node.Parent.Name.LocalName.ToUpper() != "ENTRY" ? node.Parent.Name.LocalName : ""; var sb = new StringBuilder(); s.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries).ToList().ForEach(e => sb.AppendFormat("{0}\t{1}\n", e, nodeName)); return sb.ToString(); }); elem.DescendantNodes().Where(e => e.NodeType == XmlNodeType.Text).ToList() .ForEach(c => Console.Write(tokFunc(c))); } } 

Which produces the desired result:

 C AUTHOR . AUTHOR Qiao AUTHOR and R AUTHOR . AUTHOR Melhem AUTHOR , " Reducing TITLE Communication TITLE " , 1995 DATE . 
0
source