Effectively list all characters in a given Unicode category

Often you need to list all the characters in a given Unicode category. For instance:

You can create this list, iterate over all Unicode code points and test the desired category (Python 3):

 [c for c in map(chr, range(0x110000)) if unicodedata.category(c) in ('Ll',)] 

or using regular expressions

 re.findall(r'\s', ''.join(map(chr, range(0x110000)))) 

But these methods are slow. Is there a way to search for a list of characters in a category without having to repeat them across all three?

A related question for Perl: How to get a list of all Unicode characters that have a given property?

+6
source share
1 answer

If you need to do this often, it's easy enough to create a reusable map for yourself:

 import sys import unicodedata from collections import defaultdict unicode_category = defaultdict(list) for c in map(chr, range(sys.maxunicode + 1)): unicode_category[unicodedata.category(c)].append(c) 

And from there, use this card to translate it into a series of characters for this category:

 alphabetic = unicode_category['Ll'] 

If it is too expensive for startup time, think about how to delete this structure in a file; loading this mapping from a JSON file or other syntax quick access format should not be too painful.

Once you have a match, the category search is performed in constant time, of course.

+9
source

All Articles