Beginning with:
df: val1 val2 0 NaN 10 1 10.18 1 2 25.16 1 3 44.48 1 4 85.24 1 5 36.71 1 6 77.09 1 7 81.88 1 8 22.92 1 9 44.31 1 10 15.79 1
and
xml = """ <metaGroups> <Groups> <GroupID>age</GroupID> <description>age</description> <groupname> <Name>0 - <10</Name> <min>0</min> <minInc>TRUE</minInc> <max>10</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>10 - <18</Name> <min>10</min> <minInc>TRUE</minInc> <max>18</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>18 - <35</Name> <min>18</min> <minInc>TRUE</minInc> <max>35</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>35 - <50</Name> <min>35</min> <minInc>TRUE</minInc> <max>50</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>50 - <65</Name> <min>50</min> <minInc>TRUE</minInc> <max>65</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>65 - <75</Name> <min>65</min> <minInc>TRUE</minInc> <max>75</max> <maxInc>FALSE</maxInc> </groupname> <groupname> <Name>&ge;75</Name> <min>75</min> <minInc>TRUE</minInc> </groupname> </Groups> </metaGroups> """
You can use BeautifulSoup to extract bin parameters, build labels, and apply pd.cut() :
from bs4 import BeautifulSoup as Soup from itertools import chain soup = Soup(xml, 'html.parser') bins = [] for message in soup.findAll('groupname'): min = message.find('min').text try: max = message.find('max').text bins.append([min, max]) except: bins.append([min])
at this moment we have
bins [['0', '10'], ['10', '18'], ['18', '35'], ['35', '50'], ['50', '65'], ['65', '75'], ['75']]
Then we flatten the list of list , get rid of duplicates and add the upper border:
labels = bins bins = list(np.unique(np.fromiter(chain.from_iterable(bins), dtype='int'))) last = bins[-1] bins.append(int(df.val1.max() + 1))
which gives:
[0, 10, 18, 35, 50, 65, 75, 86]
Labeling:
labels = ['[{0} - {1}]'.format(label[0], label[1]) if len(label) > 1 else '[ > {} ]'.format(label[0]) for label in labels]
and using pd.cut() :
df['binned'] = pd.cut(df.val1, bins=bins, labels=labels)
gives:
val1 val2 binned 1 10.18 1 [10 - 18] 2 25.16 1 [18 - 35] 3 44.48 1 [35 - 50] 4 85.24 1 [>= 75] 5 36.71 1 [35 - 50] 6 77.09 1 [>= 75] 7 81.88 1 [>= 75] 8 22.92 1 [18 - 35] 9 44.31 1 [35 - 50] 10 15.79 1 [10 - 18]