Tl; dr - bold text.
I work with an image dataset that comes with boolean "one-time" image annotations (Celeba specific). Annotations encode facial features, such as bald, masculine, young. Now I want to create a special hot list (to test my GAN model). I want to provide a competent interface . Instead of specifying features[12]=True , knowing that 12 — counting from scratch — corresponds to the masculine function, I need something like features[male]=True or features.male=True .
Suppose the header of my .txt file
Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young
and I want to codify Young, Bald and Chubby. Expected Result
[ 0. 0. 0. 1. 0. 1. 0. 0. 1.]
since Bald is the fourth title bar, Chubby is the sixth and so on. What is the clearest way to do this without expecting the user to know that Bald is the fourth record, etc.?
I am looking for the Putin way, not necessarily the fastest way.
Perfect features
In rough order of importance:
- A way to achieve my stated goal, which is already standard for the Python community, will take precedence.
- The user / programmer should not rely on the attribute in the
.txt header. This is what I am trying to create. - You should not expect that the user will have non-standard libraries such as
aenum . - The user / programmer should not refer to the
.txt header for attribute names / available attributes. One example: if the user wants to specify the gender attribute, but does not know whether to use male or female , this should be easy to find out. - The user / programmer should be able to find out the available attributes through the documentation (ideally generated Sphinx api-doc). That is, point 4 should be possible to read as little code as possible. Exposure of an attribute with
dir() satisfies this point sufficiently. - The programmer should find the indexing tool natural. In particular, zero indexing should be preferable to subtracting from single indexing.
- Between the other two completely identical solutions, the one with the best performance wins.
Examples:
I am going to compare and contrast what immediately occurred to me. Use all the examples:
import numpy as np header = ("Arched_Eyebrows Attractive Bags_Under_Eyes " "Bald Bangs Chubby Male Wearing_Necktie Young") NUM_CLASSES = len(header.split())
1: Understanding Dict
Obviously, we could use a dictionary for this:
binary_label = np.zeros([NUM_CLASSES]) classes = {head: idx for (idx, head) in enumerate(header.split())} binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True print(binary_label)
For what it's worth, it has the least number of lines of code and the only thing that does not rely on the standard library over the built-in ones. As for the negatives, this is not entirely self-documenting. To view the available options, you must print(classes.keys()) - it does not appear with dir() . This borders on unsatisfactory function 5 because it requires the user to know the classes - this is an attribute of the AFAIK impact.
2: Enum:
Since I'm learning C ++ right now, Enum is the first thing that comes to mind:
import enum binary_label = np.zeros([NUM_CLASSES]) Classes = enum.IntEnum("Classes", header) features = [Classes.Young, Classes.Bald, Classes.Chubby] zero_idx_feats = [feat-1 for feat in features] binary_label[zero_idx_feats] = True print(binary_label)
This gives a point notation, and image parameters are displayed using dir(Classes) . However, Enum uses single-indexing by default (the reason is documented ). The workaround makes me feel that Enum is not a pythonic way of doing this, and does not fully satisfy function 6.
3: named set
Here's another one of the Python standard library:
import collections binary_label = np.zeros([NUM_CLASSES]) clss = collections.namedtuple( "Classes", header)._make(range(NUM_CLASSES)) binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True print(binary_label)
Using namedtuple , we again get point notation and self-documentation with dir(clss) . But the namedtuple class namedtuple heavier than Enum . By that I mean, namedtuple has functionality that I don't need. This solution seems to be the leader among my examples, but I don’t know if it satisfies characteristic 1, or if the alternative can “win” through function 7.
4: User enumeration
I really could break my head:
binary_label = np.zeros([NUM_CLASSES]) class Classes(enum.IntEnum): Arched_Eyebrows = 0 Attractive = 1 Bags_Under_Eyes = 2 Bald = 3 Bangs = 4 Chubby = 5 Male = 6 Wearing_Necktie = 7 Young = 8 binary_label[ [Classes.Young, Classes.Bald, Classes.Chubby]] = True print(binary_label)
This has all the benefits of Ex. 2. But there are obvious obvious flaws. I have to write out all the functions (there are 40 in a real data set) only to zero index! Of course, here's how to do the enumeration in C ++ (AFAIK), but this is not necessary in Python. This is a small failure in feature 6.
Summary
There are many ways to achieve proper null indexing in Python. Do you provide a code snippet of how you would accomplish what I need and tell me why your path is right?
(edit :) Or explain why one of my examples is the right tool for the job?
Status update:
I am not ready to accept the answer, but in case someone wants to turn to the next feedback / update or any new solution will appear. Maybe another 24 hours? All answers were helpful, so I supported everyone all the time. You can view this repo I use to test solutions. Feel free to tell me if my following comments are accurate or unfair:
zero listing:
Oddly enough, Sphinx is documenting this incorrectly (one is indexed in documents), but it is documenting it! I believe that the “problem” does not yield any ideal function.
dotdict:
I feel Map is redundant, but dotdict is acceptable. Thanks to both recipients, this solution works with dir() . However, it does not seem to “work seamlessly” with Sphinx.
Record by number:
As written, this solution takes significantly longer than other solutions. It comes 10 times slower than namedtuple (fastest for a clean dict) and 7x slower than a standard IntEnum (slowest for numpy recording). This is not decisive at the current level, nor a priority, but a quick Google search shows that np.in1d is actually slow. Let stick
_label = np.zeros([NUM_CLASSES]) _label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True
unless i applied something wrong in the linked repo. This leads to execution speed in a range that is compared with other solutions. Again, there is no Sphinx.
namedtuple (and criticism of rassar)
I am not convinced of your criticism of Enum . It seems to me that you believe that I am approaching the problem incorrectly. This is good for me, but I don’t understand how using namedtuple fundamentally different from "Enum [which] will provide separate values ​​for each constant." I misunderstood you?
Despite this, namedtuple appears in Sphinx (correctly numbered for what it's worth). In the list of Ideal Functions, this number exactly matches profiles with a zero enumeration and before a zero enumeration.
Accepted Rationale
I accepted the null enumeration because the answer gave me the best candidate for namedtuple . By my standards, namedtuple is the best solution. . But salparadise wrote an answer that helped me feel confident in this assessment. Thanks to all who responded.