Handle a large text file in python

I have a very large file (3.8G), which is a statement of users from the system in my school. I need to process this file so that it just contains their identifier and email address, separated by a comma.

I have very little experience with this and would like to use it as a training exercise for Python.

There are entries in the file that look like this:

dn: uid=123456789012345,ou=Students,o=system.edu,o=system LoginId: 0099886 mail: fflintstone@system.edu dn: uid=543210987654321,ou=Students,o=system.edu,o=system LoginId: 0083156 mail: brubble@system.edu 

I am trying to get a file that looks like this:

 0099886, fflintstone@system.edu 0083156, brubble@system.edu 

Any hints or code?

+4
source share
4 answers

This is similar to an LDIF file. The python-ldap library has a data processing library using pure Python LDIF, which can help if your file has some of the nasty errors in LDIF, for example. Base64 encoded values, input folding, etc.

You can use it like this:

 import csv import ldif class ParseRecords(ldif.LDIFParser): def __init__(self, csv_writer): self.csv_writer = csv_writer def handle(self, dn, entry): self.csv_writer.writerow([entry['LoginId'], entry['mail']]) with open('/path/to/large_file') as input, with open('output_file', 'wb') as output: csv_writer = csv.writer(output) csv_writer.writerow(['LoginId', 'Mail']) ParseRecords(input, csv_writer).parse() 

Edit

So, to extract from the live LDAP directory using the python-ldap library, you would like to do something like this:

 import csv import ldap con = ldap.initialize('ldap://server.fqdn.system.edu') # if you're LDAP directory requires authentication # con.bind_s(username, password) try: with open('output_file', 'wb') as output: csv_writer = csv.writer(output) csv_writer.writerow(['LoginId', 'Mail']) for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']: csv_writer.writerow([attrs['LoginId'], attrs['mail']]) finally: # even if you don't have credentials, it usually good to unbind con.unbind_s() 

It might be worth reading the documentation for the ldap module , especially example .

Note that in the example above, I completely missed the supply of the filter, which you probably want to do in production. The filter in LDAP is similar to the WHERE in an SQL statement; it restricts which objects are returned. Microsoft really has a good guide to LDAP filters . Canonical reference for LDAP RFC 4515 filters.

Similarly, if there are potentially several thousand entries, even after applying the appropriate filter, you may need to study LDAP swap control , although using this again will make the example more complex. Hope this is enough to get you started, but if something comes along, feel free to ask or open a new question.

Good luck.

+9
source

Assuming that the structure of each record will always be the same, just do something like this:

 import csv # Open the file f = open("/path/to/large.file", "r") # Create an output file output_file = open("/desired/path/to/final/file", "w") # Use the CSV module to make use of existing functionality. final_file = csv.writer(output_file) # Write the header row - can be skipped if headers not needed. final_file.writerow(["LoginID","EmailAddress"]) # Set up our temporary cache for a user current_user = [] # Iterate over the large file # Note that we are avoiding loading the entire file into memory for line in f: if line.startswith("LoginID"): current_user.append(line[9:].strip()) # If more information is desired, simply add it to the conditions here # (additional elif should do) # and add it to the current user. elif line.startswith("mail"): current_user.append(line[6:].strip()) # Once you know you have reached the end of a user entry # write the row to the final file # and clear your temporary list. final_file.writerow(current_user) current_user = [] # Skip lines that aren't interesting. else: continue 
+5
source

Again, suppose your file is well-formed:

 with open(inputfilename) as inputfile, with open(outputfilename) as outputfile: mail = loginid = '' for line in inputfile: line = inputfile.split(':') if line[0] not in ('LoginId', 'mail'): continue if line[0] == 'LoginId': loginid = line[1].strip() if line[0] == 'mail': mail = line[1].strip() if mail and loginid: output.write(loginid + ',' + mail + '\n') mail = loginid = '' 

Essentially equivalent to other methods.

+1
source

To open a file, you want to use something like the with keyword to ensure that it closes correctly, even if something goes wrong:

 with open(<your_file>, "r") as f: # Do stuff 

As for the actual parsing of this information, I would recommend creating a dictionary of pairs of ID addresses. You will also need a variable for uid and email.

 data = {} uid = 0 email = "" 

To actually parse the file (the material runs while the file is open), you can do something like this:

 for line in f: if "uid=" in line: # Parse the user id out by grabbing the substring between the first = and , uid = line[line.find("=")+1:line.find(",")] elif "mail:" in line: # Parse the email out by grabbing everything from the : to the end (removing the newline character) email = line[line.find(": ")+2:-1] # Given the formatting you've provided, this comes second so we can make an entry into the dict here data[uid] = email 

Using the CSV creator (do not forget to import csv at the beginning of the file), we can output like this:

 writer = csv.writer(<filename>) writer.writerow("User, Email") for id, mail in data.iteritems: writer.writerow(id + "," + mail) 

Another option is to open the record before the file, write the header, and then read the lines from the file at the same time as writing to the CSV. This avoids dumping information into memory, which can be very desirable. So, putting all this together, we get

 writer = csv.writer(<filename>) writer.writerow("User, Email") with open(<your_file>, "r") as f: for line in f: if "uid=" in line: # Parse the user id out by grabbing the substring between the first = and , uid = line[line.find("=")+1:line.find(",")] elif "mail:" in line: # Parse the email out by grabbing everything from the : to the end (removing the newline character) email = line[line.find(": ")+2:-1] # Given the formatting you've provided, this comes second so we can make an entry into the dict here writer.writerow(iid + "," + email) 
0
source

All Articles