Parsing words into a tree

Question

Parsing words into a tree

I have a list of words. For example:

reel road root curd

I would like to save this data in a way that reflects the following structure:

 Start -> r -> e -> reel -> o -> a -> road o -> root c -> curd

It is obvious to me that I need to implement a tree. From this tree, I should be able to easily get statistics, such as node height, number of node descendants, node search, etc. Adding a node should “automatically” add it to the correct position in the tree, since this position is unique.

He would also like to be able to visualize the data in the form of a real graphic tree. Since the tree will be huge , I will need zoom / pan controls for visualization. And of course, beautiful visualization is always better than ugly.

Does anyone know of a Python package that would allow me to achieve all this simply? Writing the code itself will take quite a while. Do you think http://packages.python.org/ete2/ is suitable for this task?

I'm on Python 2.x, btw.

I found that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I cannot find examples anywhere! For example, how to add words to trie?

+4

python data-structures parsing tree

Velvet ghost Jun 04 '12 at 9:26

source share

2 answers

peixe · Answer 1 · 2012-06-04T09:44:21+0000

ETE2 is an environment for studying wood, basically made for viewing, building and studying phylogenetic trees, and I used it for a long time for these purposes. But it’s possible that if you set up your data correctly, you can do it.

You just need to place the apiaries wherever you need to split the tree and create a branch. See the following example taken from an ETE document. If you change these "(A, B, (C, D)); for your words / letters this should be done.

 from ete2 import Tree unrooted_tree = Tree( "(A,B,(C,D));" ) print unrooted_tree

exit:

  /-A | ----|--B | | /-C \---| \-D

... and this package will allow you to perform most of the operations that you want, giving you the opportunity to select each branch individually and work with it in a simple way. I recommend that you take a look at the tutorial anyway, but rather difficult :)

jhc · Answer 2 · 2012-08-01T11:05:47+0000

I think the following example does pretty much what you want with the ETE toolkit.

 from ete2 import Tree words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"] #Creates a empty tree tree = Tree() tree.name = "" # Lets keep tree structure indexed name2node = {} # Make sure there are no duplicates words = set(words) # Populate tree for wd in words: # If no similar words exist, add it to the base of tree target = tree # Find relatives in the tree for pos in xrange(len(wd), -1, -1): root = wd[:pos] if root in name2node: target = name2node[root] break # Add new nodes as necessary fullname = root for letter in wd[pos:]: fullname += letter new_node = target.add_child(name=letter, dist=1.0) name2node[fullname] = new_node target = new_node # Print structure print tree.get_ascii() # You can also use all the visualization machinery from ETE # (http://packages.python.org/ete2/tutorial/tutorial_drawing.html) # tree.show() # You can find, isolate and operate with a specific node using the index wh_node = name2node["whe"] print wh_node.get_ascii() # You can rebuild words under a given node def recontruct_fullname(node): name = [] while node.up: name.append(node.name) node = node.up name = ''.join(reversed(name)) return name for leaf in wh_node.iter_leaves(): print recontruct_fullname(leaf) /n-- /e-- /v-- /e-- /-r /e--| /w-- /h--| \r-- /e-- /v-- /e-- /-r | | | \a-- /t-- /e-- /v-- /e-- /-r | | /e-- /e-- /-l ----|-r--| | | /o-- /-t | \o--| | \a-- /-d | | /-d \c-- /u-- /r--| \-l

Parsing words into a tree

More articles: