Regex matches Egyptian hieroglyphs

Question

Regex matches Egyptian hieroglyphs

I want to know the regex to match Egyptian hieroglyphs. I am completely unfamiliar and need your help.

I cannot post letters, because the stack overflow does not seem to recognize it.

So, can anyone tell me the Unicode range for these characters.

+67

regex unicode internationalization

user4628064 Mar 06 '15 at 9:59

source share

2 answers

georg · Answer 1 · 2015-03-06 10:17

TLDNR: \p{Egyptian_Hieroglyphs}

Javascript

Egyptian_Hieroglyphs belong to the astral plane, which uses more than 16 bits to encode a character. Javascript, starting with ES5, does not support astral planes ( more on this ), so you should use surrogate pairs. First surrogate

 U+13000 = d80c dc00

last

 U+1342E = d80d dc2e

what gives

 re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g t = document.getElementById("pyramid").innerHTML document.write("<h1>Found</h1>" + t.match(re))

 <div id="pyramid"> some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮 </div>

Here's what Noto Sans Egyptian Hieroglyphs looks like:

enter image description here

Other languages

On platforms that support UCS-4, you can directly use Egyptian codes 13000 to 1342F , but the syntax differs from system to system. For example, in Python (3.3 up), this would be [\U00013000-\U0001342E] :

 >>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E" >>> s 'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮' >>> import re >>> re.findall('[\U00013000-\U0001342E]', s) ['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']

Finally, if your regex engine supports unicode properties, you can (and should) use them instead of hard-coded ranges. For example, in php / pcre:

 $str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮"; preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m); print_r($m);

prints

 [0] => Array ( [0] => 𓀀 [1] => 𓀁 [2] => 𓐬 [3] => 𓐭 [4] => 𓐮 )

nhahtdh · Answer 2 · 2015-03-07 18:29

Unicode encodes Egyptian hieroglyphs ranging from U + 13000 - U + 1342F (outside the base multilingual plane).

In this case, there are two ways to write a regular expression:

By specifying a range of characters from U + 13000 - U + 1342F.
When specifying a range of characters in a regular expression for characters in BMP is as simple as [az] , depending on the language support, it may not be so easy for characters in the astral planes.
By specifying a Unicode block for Egyptian characters
Since we match any character in an Egyptian block of characters , this is the preferred way to write a regular expression where support is available.

Java

(Currently, I do not know how another implementation of Java class libraries handles astral plane characters in Pattern classes).

Deploying Sun / Oracle

I'm not sure if it makes sense to talk about coincidence of characters in the astral planes in Java 1.4, since support for characters outside of BMP was added only in Java 5 by re-equipping the existing String implementation (which uses UCS-2 for its internal string representation) with using code-oriented methods.

^{Since Java continues to allow single surrogates (which cannot pair with another surrogate) that must be specified in String, this was a mess because surrogates are not real characters and single surrogates are not valid in UTF-16.}

Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to support Unicode character matching on the astral planes: the pattern string is converted to a code point array before it is parsed, and the input string is moved using code-oriented methods in the String class.

You can learn more about the craziness in Java regex in this tchist answer.

I wrote a detailed explanation of how to match the range of characters that includes the astral plane characters in this answer , so I am going to include the code here.It also contains some counterexamples of incorrect attempts to write a regular expression to match the astral plane characters.

Java 5 (and above)

 "[\uD80C\uDC00-\uD80D\uDC2F]"

Java 7 (and higher)

 "[\\uD80C\\uDC00-\\uD80D\\uDC2F]" "[\\x{13000}-\\x{1342F}]"

Since we map any code point belonging to a Unicode block, it can also be written as:

 "\\p{InEgyptian_Hieroglyphs}" "\\p{InEgyptian Hieroglyphs}" "\\p{InEgyptianHieroglyphs}" "\\p{block=EgyptianHieroglyphs}" "\\p{blk=Egyptian Hieroglyphs}"

Supported Java \p syntax for Unicode block with 1.4, but support for the Egyptian block of hieroglyphs was added only in Java 7.

PCRE (used in PHP)

An example PHP is already described in georg answer :

 '~\p{Egyptian_Hieroglyphs}~u'

Note that the u flag is required if you want to match code points instead of matching code blocks.

Not sure if there is a better post in StackOverflow, but I wrote some explanation about the influence of the u flag (UTF mode) in this answer of mine .

It should be noted that Egyptian_Hieroglyphs is only available from PCRE 8.02 (or versions not earlier than PCRE 7.90 ).

Alternatively, you can specify a range of characters with the syntax \x{h...hh} :

 '~[\x{13000}-\x{1342F}]~u'

Pay attention to the mandatory flag u .

The syntax \x{h...hh} supported by at least PCRE 4.50 .

JavaScript (ECMAScript)

ES5

The character range method (which is the only way to do this in vanilla JavaScript) is already included in georg answer . The regular expression is slightly modified to cover the entire block, including the reserved unassigned code point.

 /(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/

The solution above demonstrates a technique that matches the range of characters in the astral plane, as well as the limitations of JavaScript RegExp.

JavaScript also suffers from the same string representation problem as Java. Although Java did indeed fix the Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in UCS-2 times, forcing us to work with blocks of code instead of a code point in a regular expression.

ES6

That will change soon. If all goes well, it is likely that support for code point matching will be added in ECMAScript 6, which is available using the u flag to prevent breaking existing implementations in previous versions of ECMAScript.

Check Support from the second link above for a list of browsers that provide experimental support for ES6 RegExp .

With the introduction of the syntax \u{h...hh} in ES6, the range of characters can be rewritten similarly to Java 7:

 /[\u{13000}-\u{1342F}]/u

Or you can also specify the character directly in the RegExp literal, although the intent is not as clearly cut as [az] :

 /[𓀀-𓐯]/u

Note the u modifier in both regular expressions above.

Still stuck with ES5? Don't worry, you can upgrade ES6 Unicode RegExp to ES5 RegExp with regxpu .

Regex matches Egyptian hieroglyphs

Javascript

Other languages

Java

Deploying Sun / Oracle

Java 5 (and above)

Java 7 (and higher)

PCRE (used in PHP)

JavaScript (ECMAScript)

ES5

ES6

More articles: