Split multilingual string using regex for monolingual tokens

I want to split a multilingual string into monolingual tokens using Regex.

for example, for this English-Arabic line:

'his name was محمد, and his mother was آمنه.'

The result should be as follows:

  • 'his name was
  • 'محمد,
  • 'and his mother was "
  • 'آمنه.
+5
source share
2 answers

This is not ideal (you definitely need to try it with some real examples to see if it works), but this is the beginning:

splitArray = Regex.Split(subjectString, 
    @"(?<=\p{IsArabic})    # (if the previous character is Arabic)
    [\p{Zs}\p{P}]+         # split on whitespace/punctuation
    (?=\p{IsBasicLatin})   # (if the following character is Latin)
    |                      # or
    (?<=\p{IsBasicLatin})  # vice versa
    [\s\p{P}]+
    (?=\p{IsArabic})", 
    RegexOptions.IgnorePatternWhitespace);

This is divided into space / punctuation if the previous character is from the Arabic block and the next character is from the base Latin block (or vice versa).

+6
source
System.Text.RegularExpressions.Regex regx = new System.Text.RegularExpressions.Regex(@"([\s\(\:]*[a-zA-Z]+[\s\)\:]*)+");
var matchs = regx.Matches(input).Cast<System.Text.RegularExpressions.Match>().ToList();
0
source

All Articles