Effectively list all characters in a given Unicode category

Question

Effectively list all characters in a given Unicode category

Often you need to list all the characters in a given Unicode category. For instance:

You can create this list, iterate over all Unicode code points and test the desired category (Python 3):

 [c for c in map(chr, range(0x110000)) if unicodedata.category(c) in ('Ll',)]

or using regular expressions

 re.findall(r'\s', ''.join(map(chr, range(0x110000))))

But these methods are slow. Is there a way to search for a list of characters in a category without having to repeat them across all three?

A related question for Perl: How to get a list of all Unicode characters that have a given property?

+6

python unicode character-properties

Mechanical snail Jan 9 '13 at 20:30

source share

1 answer

Martijn pieters · Answer 1 · 2013-01-09T20:38:37+0000

If you need to do this often, it's easy enough to create a reusable map for yourself:

 import sys import unicodedata from collections import defaultdict unicode_category = defaultdict(list) for c in map(chr, range(sys.maxunicode + 1)): unicode_category[unicodedata.category(c)].append(c)

And from there, use this card to translate it into a series of characters for this category:

 alphabetic = unicode_category['Ll']

If it is too expensive for startup time, think about how to delete this structure in a file; loading this mapping from a JSON file or other syntax quick access format should not be too painful.

Once you have a match, the category search is performed in constant time, of course.

Effectively list all characters in a given Unicode category

More articles: