A smart way to index a list, where each element has an interpretation?

Tl; dr - bold text.

I work with an image dataset that comes with boolean "one-time" image annotations (Celeba specific). Annotations encode facial features, such as bald, masculine, young. Now I want to create a special hot list (to test my GAN model). I want to provide a competent interface . Instead of specifying features[12]=True , knowing that 12 — counting from scratch — corresponds to the masculine function, I need something like features[male]=True or features.male=True .

Suppose the header of my .txt file

 Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young 

and I want to codify Young, Bald and Chubby. Expected Result

 [ 0. 0. 0. 1. 0. 1. 0. 0. 1.] 

since Bald is the fourth title bar, Chubby is the sixth and so on. What is the clearest way to do this without expecting the user to know that Bald is the fourth record, etc.?

I am looking for the Putin way, not necessarily the fastest way.

Perfect features

In rough order of importance:

  • A way to achieve my stated goal, which is already standard for the Python community, will take precedence.
  • The user / programmer should not rely on the attribute in the .txt header. This is what I am trying to create.
  • You should not expect that the user will have non-standard libraries such as aenum .
  • The user / programmer should not refer to the .txt header for attribute names / available attributes. One example: if the user wants to specify the gender attribute, but does not know whether to use male or female , this should be easy to find out.
  • The user / programmer should be able to find out the available attributes through the documentation (ideally generated Sphinx api-doc). That is, point 4 should be possible to read as little code as possible. Exposure of an attribute with dir() satisfies this point sufficiently.
  • The programmer should find the indexing tool natural. In particular, zero indexing should be preferable to subtracting from single indexing.
  • Between the other two completely identical solutions, the one with the best performance wins.

Examples:

I am going to compare and contrast what immediately occurred to me. Use all the examples:

 import numpy as np header = ("Arched_Eyebrows Attractive Bags_Under_Eyes " "Bald Bangs Chubby Male Wearing_Necktie Young") NUM_CLASSES = len(header.split()) # 9 

1: Understanding Dict

Obviously, we could use a dictionary for this:

 binary_label = np.zeros([NUM_CLASSES]) classes = {head: idx for (idx, head) in enumerate(header.split())} binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True print(binary_label) 

For what it's worth, it has the least number of lines of code and the only thing that does not rely on the standard library over the built-in ones. As for the negatives, this is not entirely self-documenting. To view the available options, you must print(classes.keys()) - it does not appear with dir() . This borders on unsatisfactory function 5 because it requires the user to know the classes - this is an attribute of the AFAIK impact.

2: Enum:

Since I'm learning C ++ right now, Enum is the first thing that comes to mind:

 import enum binary_label = np.zeros([NUM_CLASSES]) Classes = enum.IntEnum("Classes", header) features = [Classes.Young, Classes.Bald, Classes.Chubby] zero_idx_feats = [feat-1 for feat in features] binary_label[zero_idx_feats] = True print(binary_label) 

This gives a point notation, and image parameters are displayed using dir(Classes) . However, Enum uses single-indexing by default (the reason is documented ). The workaround makes me feel that Enum is not a pythonic way of doing this, and does not fully satisfy function 6.

3: named set

Here's another one of the Python standard library:

 import collections binary_label = np.zeros([NUM_CLASSES]) clss = collections.namedtuple( "Classes", header)._make(range(NUM_CLASSES)) binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True print(binary_label) 

Using namedtuple , we again get point notation and self-documentation with dir(clss) . But the namedtuple class namedtuple heavier than Enum . By that I mean, namedtuple has functionality that I don't need. This solution seems to be the leader among my examples, but I don’t know if it satisfies characteristic 1, or if the alternative can “win” through function 7.

4: User enumeration

I really could break my head:

 binary_label = np.zeros([NUM_CLASSES]) class Classes(enum.IntEnum): Arched_Eyebrows = 0 Attractive = 1 Bags_Under_Eyes = 2 Bald = 3 Bangs = 4 Chubby = 5 Male = 6 Wearing_Necktie = 7 Young = 8 binary_label[ [Classes.Young, Classes.Bald, Classes.Chubby]] = True print(binary_label) 

This has all the benefits of Ex. 2. But there are obvious obvious flaws. I have to write out all the functions (there are 40 in a real data set) only to zero index! Of course, here's how to do the enumeration in C ++ (AFAIK), but this is not necessary in Python. This is a small failure in feature 6.

Summary

There are many ways to achieve proper null indexing in Python. Do you provide a code snippet of how you would accomplish what I need and tell me why your path is right?

(edit :) Or explain why one of my examples is the right tool for the job?


Status update:

I am not ready to accept the answer, but in case someone wants to turn to the next feedback / update or any new solution will appear. Maybe another 24 hours? All answers were helpful, so I supported everyone all the time. You can view this repo I use to test solutions. Feel free to tell me if my following comments are accurate or unfair:

zero listing:

Oddly enough, Sphinx is documenting this incorrectly (one is indexed in documents), but it is documenting it! I believe that the “problem” does not yield any ideal function.

dotdict:

I feel Map is redundant, but dotdict is acceptable. Thanks to both recipients, this solution works with dir() . However, it does not seem to “work seamlessly” with Sphinx.

Record by number:

As written, this solution takes significantly longer than other solutions. It comes 10 times slower than namedtuple (fastest for a clean dict) and 7x slower than a standard IntEnum (slowest for numpy recording). This is not decisive at the current level, nor a priority, but a quick Google search shows that np.in1d is actually slow. Let stick

 _label = np.zeros([NUM_CLASSES]) _label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True 

unless i applied something wrong in the linked repo. This leads to execution speed in a range that is compared with other solutions. Again, there is no Sphinx.

namedtuple (and criticism of rassar)

I am not convinced of your criticism of Enum . It seems to me that you believe that I am approaching the problem incorrectly. This is good for me, but I don’t understand how using namedtuple fundamentally different from "Enum [which] will provide separate values ​​for each constant." I misunderstood you?

Despite this, namedtuple appears in Sphinx (correctly numbered for what it's worth). In the list of Ideal Functions, this number exactly matches profiles with a zero enumeration and before a zero enumeration.

Accepted Rationale

I accepted the null enumeration because the answer gave me the best candidate for namedtuple . By my standards, namedtuple is the best solution. . But salparadise wrote an answer that helped me feel confident in this assessment. Thanks to all who responded.

+7
python
source share
4 answers

What about the factory function for creating a null indexed IntEnum , since it is an object that suits your needs, and Enum provides the flexibility to build:

 from enum import IntEnum def zero_indexed_enum(name, items): # splits on space, so it won't take any iterable. Easy to change depending on need. return IntEnum(name, ((item, value) for value, item in enumerate(items.split()))) 

Then:

 In [43]: header = ("Arched_Eyebrows Attractive Bags_Under_Eyes " ...: "Bald Bangs Chubby Male Wearing_Necktie Young") In [44]: Classes = zero_indexed_enum('Classes', header) In [45]: list(Classes) Out[45]: [<Classes.Arched_Eyebrows: 0>, <Classes.Attractive: 1>, <Classes.Bags_Under_Eyes: 2>, <Classes.Bald: 3>, <Classes.Bangs: 4>, <Classes.Chubby: 5>, <Classes.Male: 6>, <Classes.Wearing_Necktie: 7>, <Classes.Young: 8>] 
+3
source share

You can use a custom class that I like to call DotMap or, as mentioned here, this SO discussion is like Map :

  • fooobar.com/questions/47184 / ... ( Map , more complete version)
  • fooobar.com/questions/47184 / ... ( dotdict , shorter lighter version)

About Map :

  • It has dictionary functions since the input to Map / DotMap is a dict. You can access attributes using features['male'] .
  • Alternatively, you can access attributes using the point ie features.male , and the attributes will be displayed when you do dir(features) .
  • It is as heavy as necessary to enable point functionality.
  • Unlike namedtuple you do not need to predefine it, and you can add or remove keys perforce.
  • The Map function described in the SO question is not compatible with Python3 because it uses iteritems() . Just replace it with items() .

About dotdict :

  • dotdict provides the same benefits of Map , except that it does not override the dir() method, so you cannot get attributes for documentation. @SigmaPiEpsilon has provided a fix for this here .
  • It uses the dict.get method instead of the dict.__getitem__ , so it will return None instead of throwing a KeyError when you have access attributes that don't exist.
  • It does not recursively apply dotdict-iness to nested dicts, so you cannot use features.foo.bar .

Here's an updated version of dotdict that solves the first two problems:

 class dotdict(dict): __getattr__ = dict.__getitem__ # __getitem__ instead of get __setattr__ = dict.__setitem__ __delattr__ = dict.__delitem__ def __dir__(self): # by @SigmaPiEpsilon for documentation return self.keys() 

Update

Map and dotdict do not have the same behavior as @SigmaPiEpsilon indicated, so I added separate descriptions for both.

+2
source share

From your examples, 3 is the most pythonic answer to your question.

1, as you said, does not even answer your question, since the names are not explicit.

2 uses enums, which, although they are in the standard library, are not pythons and are not used at all in these Python scripts. (Edit): In this case, you will need only two different constants - target values ​​and others. Enum will provide separate values ​​for each constant, which is not the goal of your program and seems to be a roundabout way of approaching the problem.

4 is simply not supported if the client wants to add options, and even so, this is painstaking work.

3 uses well-known classes from the standard library in a readable and compressed form. In addition, he has no shortcomings, since he is completely clear. Being too “heavy” doesn't matter if you don't care about performance, and in any case, the lag will be invisible with your input size.

+1
source share

Your requirements, if I understand correctly, can be divided into two parts:

  • Access to the position of title elements in .txt by name in the most probable way and with minimal external dependencies

  • Enable point access to a data structure containing header names so that you can call dir() and configure a simple interface using Sphinx

Python path (no external dependencies)

The most pythonic way to solve the problem is, of course, the method using dictionaries (dictionaries are the basis of python). Searching for a dictionary through a key is also much faster than other methods. The only problem is that this prevents access to points. Another answer mentions Map and dotdict as alternatives. dotdict simpler, but only allows access to points, this will not help in the documentation aspect with dir() , since dir() calls the __dir__() method, which is not overridden in these cases. Therefore, it will only return Python dict attributes, not header names. See below:

 >>> class dotdict(dict): ... __getattr__ = dict.get ... __setattr__ = dict.__setitem__ ... __delattr__ = dict.__delitem__ ... >>> somedict = {'a' : 1, 'b': 2, 'c' : 3} >>> somedotdict = dotdict(somedict) >>> somedotdict.a 1 >>> 'a' in dir(somedotdict) False 

There are two ways around this problem.

Option 1 Override the __dir__() method as shown below. But this will only work when you call dir() on class instances. For the changes to be applied to the class itself, you need to create a metaclass for the class. See here

 #add this to dotdict def __dir__(self): return self.keys() >>> somedotdictdir = dotdictdir(somedict) >>> somedotdictdir.a 1 >>> dir(somedotdictdir) ['a', 'b', 'c'] 

Option 2 The second option, which makes it much closer to a user-defined object with attributes, is to update the __dict__ attribute of the created object. This is what the Map uses. Regular python dict does not have this attribute. If you add this, you can call dir() to get the attributes / keys, as well as all the additional python dict methods / attributes. If you just need the stored attribute and values, you can use vars(somedotdictdir) , which is also useful for documentation.

 class dotdictdir(dict): def __init__(self, *args, **kwargs): dict.__init__(self, *args, **kwargs) self.__dict__.update({k : v for k,v in self.items()}) def __setitem__(self, key, value): dict.__setitem__(self, key, value) self.__dict__.update({key : value}) __getattr__ = dict.get #replace with dict.__getitem__ if want raise error on missing key access __setattr__ = __setitem__ __delattr__ = dict.__delitem__ >>> somedotdictdir = dotdictdir(somedict) >>> somedotdictdir {'a': 3, 'c': 6, 'b': 4} >>> vars(somedotdictdir) {'a': 3, 'c': 6, 'b': 4} >>> 'a' in dir(somedotdictdir) True 

The way in numbers

Another option would be to use the numpy record array, which provides access to points. I noticed in your code that you are already using numpy. In this case, it is too difficult for __dir__() to get the attributes. This can lead to faster operations (not verified) for data with a large number of other numerical values.

 >>> headers = "Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young".split() >>> header_rec = np.array([tuple(range(len(headers)))], dtype = zip(headers, [int]*len(headers))) >>> header_rec.dtype.names ('Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 'Chubby', 'Male', 'Wearing_Necktie', 'Young') >>> np.in1d(header_rec.item(), [header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]).astype(int) array([0, 0, 0, 1, 0, 1, 0, 0, 1]) 

In Python 3, you will need to use dtype=list(zip(headers, [int]*len(headers))) , since zip become its own object.

+1
source share

All Articles