Parsing EDGAR documents

I would like to use python2.7 to remove anything that is not the text of documents from EDGAR records (which are available on the Internet as .txt files). An example of how the files look:

Example

EDGAR provides its document type definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file, which I named "parseme.txt". I would like to know how to use DTD to parse a .txt file. I would use a canned syntax module such as BeautifulSoup to work, but the EDGAR format seems unique, and I hope to avoid a lot of regular expression to get the job done.

import os filename = 'parseme.txt' with open(filename) as f: lines = f.readlines() 

My question is related to a question in Parse SGML with open arbitrary labels in Python 3 and Use lxml to parse a text file with a bad header in Python , but I find it great since my question is about python2.7 and I'm not interested in the header - I just interested in the text of the file.

+6
source share
3 answers

Check out the OpenSP toolkit , which has programs for handling SGML files. Perhaps your easiest option is to use the osx program to get the XML version of the input file, after which you can use the XML processing tools.

There may be some tweaking at first, since the OpenSP package does not come with the EDGAR DTD or its SGML declaration (the first part of the material in your link is on page 48, starting with <!SGML "ISO 8879-1986" ). You will have to get them as text files and add them to the directories where the SP parser can find them.

UPDATE : This document seems to be a more modern version. However, a random Google search does not mean that all processed versions can be immediately processed by the machine. You may need to copy-paste from a PDF.

However, if you do this, some external formatting will appear that you will need to remove: it looks like there are page break indicators designated as "C-1", "C-2", etc. They are not part of SGML and must be removed.

You can either add the SGML declaration or EDGAR DTD to the directory (in this case, the DTD file should only have a part inside [after <!DOCTYPE submission and matching] at the end), or you can create a β€œprolog” consisting of both parts together, as it is (i.e. including <!DOCTYPE submission [ and ]> ), and run any program in the prolog toolkit and your SGML file β€” that is, put both names on the command line with the prolog file first, so that the analyzer reads both files are in the correct order. To understand what is happening, you need to know that the SGML parser needs three pieces of information for parsing: an SGML declaration for setting some environmental parameters and processing, then a DTD to describe the structural constraints for the document and, finally, the document itself.

+4
source

The pysec project looks promising. This is the main Django application that downloads the Edgar index, and then allows you to download specific applications and extract financial parameters from XBRL.

+4
source

Below is a library that parses EDGAR records in SQLite DB. It contains functionality for pushing Form10k and Form8Qk applications from the EDGAR FPT website for many years, which you specify and load in a normalized format in SQLite DB tables. Given that the poorly followed application standard, writing your own script analysis would be a significant undertaking. This library and code, similar to the ones below, download applications for the requested quarter and from there you can simply request a table for the data you are looking for.

 edgar.database.create() # Load quarterly master index files into local sqlite db quarters = [] #Q3 2009 quarters.add(2009,3) #Q3 2008 quarters.add(2008,3) edgar.database.load(quarters) 

http://rf-contrib.googlecode.com/svn/trunk/ha/src/main/python/edgar/

+1
source

All Articles