How can I split text into lines based on a regular expression?

I have snippets of text, and I would like to split them into lines. The problem is that they were formatted and therefore I can’t break it down, as I usually do:

 _text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            .ToArray();

Here is a sample text:

 adj 1: around the middle of a scale of evaluation of physical
        measures; "an orange of average size"; "intermediate
        capacity"; "a plane with intermediate range"; "medium
        bombers" [syn: {average}, {intermediate}]
 2: (of meat) cooked until there is just a little pink meat
    inside
 n 1: a means or instrumentality for storing or communicating
      information
 2: the surrounding environment; "fish require an aqueous
    medium"
 3: an intervening substance through which signals can travel as
    a means for communication
 4: (bacteriology) a nutrient substance (solid or liquid) that
    is used to cultivate micro-organisms [syn: {culture medium}]
 5: an intervening substance through which something is
    achieved; "the dissolving medium is called a solvent"
 6: a liquid with which pigment is mixed by a painter
 7: (biology) a substance in which specimens are preserved or
    displayed
 8: a state that is intermediate between extremes; a middle
    position; "a happy medium"

The format is always the same:

  • 1-3 letter words may be present
  • number 1-10
  • colon
  • space
  • text that can appear on multiple lines.

So, in this case, the line break should be something like the word 1-3 char, followed by the number 1-2 characters, and then:

Can someone give me some advice on how I can do this using split or using another method?

: , , . , , , , :

    public parser(string text)
    {
        //_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            // .ToArray();

        string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
        foreach (Match m in Regex.Matches(text, pattern))
        {
            if (m.Success)
            {
                string entry = string.Join(Environment.NewLine,
                    m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
                // ...
            }
        }
    }

:

"medium\n adj 1: \n ," \ ", \" \n \ "; \" \ "; \" \n \ "[: {}, {}]\n 2: (), , \n \nn 1: \n \n 2: ; \" \n \ "\n 3: , \na \n 4: ( ) ( ), \n [: { ]]\n 5: , - ,\ \" \n 6: , \n 7: () , spec imens \n \n 8: , ; \n ;\ " \" \n 9: -, ; " " [: {]]\n 10: , \n [syn: {mass medium}]\n 11: , ; \in \[syn: {metier}]\n [: {media} (pl)]\n "

+4
2

Regex . :

public parser(string text)
{
    string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
    var entries = new List<string>();
    foreach (Match m in Regex.Matches(text, pattern))
        if(m.Success)
            entries.Add(string.Join(" ", 
                m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
    _text = entries.ToArray();
}
+2

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ConsoleApplication106
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            List<Data> data = new List<Data>();
            string pattern = @"(?'prefix'\w*)?\s*?(?'index'\d+):(?'text'.*)";
            StreamReader reader = new StreamReader(FILENAME);
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                Match match = Regex.Match(inputLine, pattern);
                Data newData = new Data();
                data.Add(newData);
                newData.prefix = match.Groups["prefix"].Value;
                newData.index = int.Parse(match.Groups["index"].Value);
                newData.text = match.Groups["text"].Value;
            }
        }
    }
    public class Data
    {
        public string prefix { get; set; }
        public int index { get; set; }
        public string text { get; set; }
    }
}
+2

All Articles