Initialize / create / populate dictal dictate dictate dictate in Python

I have used dictionaries in python before, but I'm still new to python. This time I use dictionary dictionary dictionary ... i.e. Three-layer dict and wanted to check before programming it.

I want to store all the data in this three-layer dict and wondered what would be a good Putin way of initializing, and then reading the file and writing to such a data structure.

The dictionary I want is of the following type:

{'geneid': {'transcript_id': {col_name1:col_value1, col_name2:col_value2} } } 

Data of this type:

 geneid\ttx_id\tcolname1\tcolname2\n hello\tNR432\t4.5\t6.7 bye\tNR439\t4.5\t6.7 

Any ideas on how to do this in a good way?

Thanks!

+4
source share
3 answers

First, start with csv to handle line parsing:

 import csv with open('mydata.txt', 'rb') as f: for row in csv.DictReader(f, delimiter='\t'): print row 

This will print:

 {'geneid': 'hello', 'tx_id': 'NR432', 'col_name1': '4.5', 'col_name2': 6.7} {'geneid': 'bye', 'tx_id': 'NR439', 'col_name1': '4.5', 'col_name2': 6.7} 

So, now you just need to reorganize this into your preferred structure. This is almost trivial, except that you have to deal with the fact that the first time you see a given geneid , you need to create a new empty dict for it, and also the first time you see a given tx_id within geneid . You can solve this with setdefault :

 import csv genes = {} with open('mydata.txt', 'rb') as f: for row in csv.DictReader(f, delimiter='\t'): gene = genes.setdefault(row['geneid'], {}) transcript = gene.setdefault(row['tx_id'], {}) transcript['colname1'] = row['colname1'] transcript['colname2'] = row['colname2'] 

You can make this more readable with defaultdict :

 import csv from collections import defaultdict from functools import partial genes = defaultdict(partial(defaultdict, dict)) with open('mydata.txt', 'rb') as f: for row in csv.DictReader(f, delimiter='\t'): genes[row['geneid']][row['tx_id']]['colname1'] = row['colname1'] genes[row['geneid']][row['tx_id']]['colname2'] = row['colname2'] 

The trick here is that the top level of the dict is special, which returns an empty dict whenever it first sees a new key ... and that an empty dict it returns an empty dict by itself. The only tricky part is that defaultdict accepts a function that returns the correct type of the object, and the function that returns defaultdict(dict) must be written using partial , lambda or explicit functions. (There are recipes for ActiveState and modules for PyPI that will give you an even more general version of this, which will add new dictionaries as needed.)

+4
source

I also tried to find alternatives and came up with this excellent answer in stackoverflow:

What is the best way to initialize dicts dictates in Python?

Basically in my case:

 class AutoVivification(dict): """Implementation of perl autovivification feature.""" def __getitem__(self, item): try: return dict.__getitem__(self, item) except KeyError: value = self[item] = type(self)() return value 
+2
source

I have to do this regularly in coding my studies. You will want to use the defaultdict package because it allows you to add key: value pairs at any level with a simple purpose. I will show you after answering your question. This comes directly from one of my programs. Focus on the last 4 lines (these are not comments) and see how the variables are saved in the rest of the block to see what it does:

 from astropy.io import fits #this package handles the image data I work with import numpy as np import os from collections import defaultdict klist = ['hdr','F','Ferr','flag','lmda','sky','skyerr','tel','telerr','wco','lsf'] dtess = [] for file in os.listdir(os.getcwd()): if file.startswith("apVisit"): meff = fits.open(file, mode='readonly', ignore_missing_end=True) hdr = meff[0].header oid = str(hdr["OBJID"]) #object ID mjd = int(hdr["MJD5"].strip(' ')) #5-digit observation date for k,v in enumerate(klist): if k==0: dtess = dtess+[[oid,mjd,v,hdr]] else: dtess=dtess+[[oid,mjd,v,meff[k].data]] #header extension works differently from the rest of the image cube #it not relevant to populating dictionaries #HDUs in order of extension no.: header, flux, flux error, flag mask, # wavelength, sky flux, error in sky flux, telluric flux, telluric flux errors, # wavelength solution coefficients, & line-spread function dtree = defaultdict(lambda: defaultdict(lambda: defaultdict(list))) for s,t,u,v in dtess: dtree[s][t][u].append(v) #once you've added all the keys you want to your dictionary, #set default_factory attribute to None dtree.default_factory = None 

Here is the digest.

  • Firstly, for the n-level dictionary you need to sort and dump everything into a list of (n + 1) -strings in the form [key_1, key_2, ..., key_n, value].
  • Then, to initialize the n-level dictionary, you simply type "defaultdict (lambda:" (minus quotation marks) n-1 times, stick "defaultdict (list)" (or some other data type) at the end and close the parentheses.
  • Add to list with for loop. * Note: when you go to access data values ​​at the lowest level, you probably have to type my_dict [key_1] [key_2] [...] [key_n] [0] to get actual values, not just data descriptions type.
  • * Edit: when your dictionary is as important as you want it, set default_factory - None.

If you did not set default_factory to None, you can add your nested dictionary later by typing something like my_dict [key_1] [key_2] [...] [new_key] = new_value or using the append () command. You can even add additional dictionaries if the ones you add on these assignment forms are not nested by yourself.

* WARNING! The recently added last line of this code snippet in which you set the default_factory attribute to None is very important. Your computer should know when you finish adding to the dictionary, otherwise it may continue to allocate memory in the background to prevent buffer overflows, waiting for the program to stop. This is a type of memory leak . I learned this hard way after I wrote this answer. This problem bothered me for several months, and I don’t even think that I was the one who understood this at the end, because I didn’t understand anything about memory allocation.

+2
source

All Articles