Node.js Emoji Parsing

I am trying to parse an input string to determine if it contains any non-emojis.

I went through this wonderful article from Mathias and use both native punycode for encoding / decoding and regenerate for generating regular expressions. I also use EmojiData to get my emojis dictionary.

With all that said, some emoks continue to be annoying little buggers and refuse to comply. For some emoji, I keep getting a couple of code points.

 // Example of a single code point: console.log(punycode.ucs2.decode('πŸ’©')); >> [ 128169 ] // Example of a paired code point: console.log(punycode.ucs2.decode('βŒ›οΈ')); >> [ 8987, 65039 ] 

Matthias addresses this in his article (and gives an example of punycode working on this), but even using his example, I get the wrong answer:

 function countSymbols(string) { return punycode.ucs2.decode(string).length; } console.log(countSymbols('πŸ’©')); >> 1 console.log(countSymbols('βŒ›οΈ')); >> 2 

What is the best way to determine if a string contains all emojis or not? This is to prove the concept, so the decision can be as brute force as necessary.

--- UPDATE ---

A bit more context on my annoying emoji above.

They are visually identical, but are actually different Unicode values ​​(second from the above example):

 βŒ› // \u231b βŒ›οΈ // \u231b\ufe0f 

The first works fine, the second does not. Unfortunately, the second version is what iOS seems to be using (if you copy and paste from iMessage, you get the second, and the same when you get text from Twilio).

+7
javascript unicode punycode emoji
source share
2 answers

U+FE0F not a combinational label, it is a variational sequence that controls the display of the glyph (see this answer ). Deleting such sequences can change the appearance of a character, for example: U+231B + U+FE0E (βŒ›οΈŽ).

In addition, emoji sequences can be made from several code points. For example, U+0032 (2) by itself is not emoji, but U+0032 + U+20E3 (2⃣) or U+0032 + U+20E3 + U+FE0F (2 ⃣️) is-but U+0041 + U+20E3 (A⃣) - no. A complete list of emoji sequences is maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library seems to have this information).

To check if a string contains emoji characters, you will need to check if any one character is in emoji-data.txt , or run a substring for the sequence in it.

+4
source share

If, presumably, you know which non-emmedia characters you expect to encounter, you can use little lodash magic through their toArray or split modules, which are famous emoji. For example, if you want to see if a string contains alphanumeric characters, you can write a function like this:

 function containsAlphaNumeric(string){ return _(string).toArray().filter(function(char){ return char.match(/[a-zA-Z0-9]/); }).value().length > 0 ? true : false; } 
0
source share

All Articles