Pythonic way to extract values ​​from this text file

I have an output file from an old piece of software, which is shown below. I want to extract values ​​from it, so for example, I can set a variable named direct_solar_irradiance to 648.957 and target ground pressure to 1013.00 .

So far I have been extracting individual rows and processing them as shown below (repeated many times for different values ​​that I want to extract):

 values = lines[97].split() self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values 

However, now I have found that extra lines are added to the middle of the output when certain parameters are selected. This means, of course, that the 97th line will no longer have the values ​​that I need.

Is there a good Pythonic way to extract these values, given that additional lines may be added under certain circumstances? I think I need to look for known snippets of text in the file, and then extract the numbers they refer to, but the only ways I can do are very awkward.

So:

  • Is there a good Pythonic way of finding these strings and retrieving the values ​​I want?

  • If not, is there any other way to reasonably do this? (for example, some cool library for analyzing text files that I don’t know anything about).

     ******************************* 6sV version 1.0B ****************************** * * * geometrical conditions identity * * ------------------------------- * * user defined conditions * * * * month: 14 day : 1 * * solar zenith angle: 10.00 deg solar azimuthal angle: 20.00 deg * * view zenith angle: 30.00 deg view azimuthal angle: 40.00 deg * * scattering angle: 159.14 deg azimuthal angle difference: 20.00 deg * * * * atmospheric model description * * ----------------------------- * * atmospheric model identity : * * midlatitude summer (uh2o=2.93g/cm2,uo3=.319cm-atm) * * aerosols type identity : * * Maritime aerosol model * * optical condition identity : * * visibility : 8.49 km opt. thick. 550 nm : 0.5000 * * * * spectral condition * * ------------------ * * monochromatic calculation at wl 0.400 micron * * * * Surface polarization parameters * * ---------------------------------- * * * * * * Surface Polarization Q,U,Rop,Chi 0.00000 0.00000 0.00000 0.00 * * * * * * target type * * ----------- * * homogeneous ground * * monochromatic reflectance 1.000 * * * * target elevation description * * ---------------------------- * * ground pressure [mb] 1013.00 * * ground altitude [km] 0.000 * * * * plane simulation description * * ---------------------------- * * plane pressure [mb] 1013.00 * * plane altitude absolute [km] 0.000 * * atmosphere under plane description: * * ozone content 0.000 * * h2o content 0.000 * * aerosol opt. thick. 550nm 0.000 * * * * atmospheric correction activated * * -------------------------------- * * BRDF coupling correction * * input apparent reflectance : 0.500 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * apparent reflectance 1.1287696 appar. rad.(w/m2/sr/mic) 588.646 * * total gaseous transmittance 1.000 * * * ******************************************************************************* * * * coupling aerosol -wv : * * -------------------- * * wv above aerosol : 1.129 wv mixed with aerosol : 1.129 * * wv under aerosol : 1.129 * ******************************************************************************* * * * integrated values of : * * -------------------- * * * * app. polarized refl. 0.0000 app. pol. rad. (w/m2/sr/mic) 0.000 * * direction of the plane of polarization 0.00 * * total polarization ratio 0.000 * * * ******************************************************************************* * * * int. normalized values of : * * --------------------------- * * % of irradiance at ground level * * % of direct irr. % of diffuse irr. % of enviro. irr * * 0.351 0.354 0.295 * * reflectance at satellite level * * atm. intrin. ref. background ref. pixel reflectance * * 0.000 0.000 1.129 * * * * int. absolute values of * * ----------------------- * * irr. at ground level (w/m2/mic) * * direct solar irr. atm. diffuse irr. environment irr * * 648.957 655.412 544.918 * * rad at satel. level (w/m2/sr/mic) * * atm. intrin. rad. background rad. pixel radiance * * 0.000 0.000 588.646 * * * * * * sol. spect (in w/m2/mic) * * 1663.594 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * downward upward total * * global gas. trans. : 1.00000 1.00000 1.00000 * * water " " : 1.00000 1.00000 1.00000 * * ozone " " : 1.00000 1.00000 1.00000 * * co2 " " : 1.00000 1.00000 1.00000 * * oxyg " " : 1.00000 1.00000 1.00000 * * no2 " " : 1.00000 1.00000 1.00000 * * ch4 " " : 1.00000 1.00000 1.00000 * * co " " : 1.00000 1.00000 1.00000 * * * * * * rayl. sca. trans. : 0.84422 1.00000 0.84422 * * aeros. sca. " : 0.94572 1.00000 0.94572 * * total sca. " : 0.79616 1.00000 0.79616 * * * * * * * * rayleigh aerosols total * * * * spherical albedo : 0.23410 0.12354 0.29466 * * optical depth total: 0.36193 0.55006 0.91199 * * optical depth plane: 0.00000 0.00000 0.00000 * * reflectance I : 0.00000 0.00000 0.00000 * * reflectance Q : 0.00000 0.00000 0.00000 * * reflectance U : 0.00000 0.00000 0.00000 * * polarized reflect. : 0.00000 0.00000 0.00000 * * degree of polar. : nan 0.00 nan * * dir. plane polar. : -45.00 -45.00 -45.00 * * phase function I : 1.38819 0.27621 0.71751 * * phase function Q : -0.09117 -0.00856 -0.04134 * * phase function U : -1.34383 0.02142 -0.52039 * * primary deg. of pol: -0.06567 -0.03099 -0.05762 * * sing. scat. albedo : 1.00000 0.98774 0.99261 * * * * * ******************************************************************************* ******************************************************************************* ******************************************************************************* * atmospheric correction result * * ----------------------------- * * input apparent reflectance : 0.500 * * measured radiance [w/m2/sr/mic] : 260.747 * * atmospherically corrected reflectance * * Lambertian case : 0.52995 * * BRDF case : 0.52995 * * coefficients xa xb xc : 0.00241 0.00000 0.29466 * * y=xa*(measured radiance)-xb; acr=y/(1.+xc*y) * 

+4
source share
5 answers

you can create your own mini-language, i.e. automate extraction. I did the following to automate the analysis of a proprietary output program

 # will match in the order written here tokens = ["num_ref_frames", "Max QP", "Min QP", "Avg QP", "I4x4", "I16x16", "SkipZero", "SkipMV", "16x16", "16x8", "8x16", "8x8", "8x4", "4x8", "4x4"] special = ["Quarterpel MVs"] # this dictionary (hash-table) contains the search string from tokens array # as well as an array where the first element is the field to extract to # create matrix array. eg 0 = 1st field, 1 = 2nd field, 3 = 3rd field etc. dict = {tokens[0]: [1], tokens[1]: [1], tokens[2]: [1], tokens[3]: [1], tokens[4]: [2], tokens[5]: [2], tokens[6]: [2], tokens[7]: [2], tokens[8]: [2], tokens[9]: [2], tokens[10]: [2], tokens[11]: [2], tokens[12]: [2], tokens[13]: [2], tokens[14]: [2],} 

Then I just looped around the input and checked the contents of token for each line; if a match is found, I split according to dict-entry to extract the correct field.

special was handled above, as well as a special variable that required reading from several lines.

Update

clone git://gist.github.com/1037403.git to get a copy of the code

 usage: ./parser.py all_dec.txt 

Hope this helps!

+2
source

A more complete, possibly more reliable solution will require the use of either a parser using custom grammars ( pyparsing ) or some kind of FSM processor ( TextFSM ).

Both options, such as they will be nontrivial for use with this output. A (possibly) easier solution would be to identify each line based on known labels and then extract accordingly (as suggested by other posters).

There are several ways to implement this. I would suggest displaying an “extractor” for known line labels, and then iterate over and call the associated extractors. Each caller can take strings and a context object / dict as arguments and add attributes to the context if necessary. Something like https://gist.github.com/1035938

+3
source

Well, if you need a common parsing library, pyparsing , but in this case it is likely to be redundant.

This is apparently a pretty text-based text file that is not so large in size, so it is best to scroll through each line looking for text that will determine what you are after.

So something like:

 lines = open('file.txt', 'r') for n, line in enumerate(lines): if 'direct solar irr. atm. diffuse irr. environment irr' in line: values = lines[n+1].split() # after the next line after this one self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values 

Then you can add additional if statements, etc. to get other data. Although, if you have a lot of data, you probably want to generalize the code a bit. (Probably a dictionary with text that should match the key and the call function when the key matches).

You can also use regex to match a string so that you can better handle different amounts of white space. Otherwise, just one place, too much or too little, will throw it out.

+1
source

The best way IMHO is to use the mmaped file and then use the regex to find what you are looking for.

  text = mmap.mmap(file) re.sub(pattern, text) 

The Mmap module displays the file as text, so you can perform almost any operation that you perform on a line. And regex is the best way to search for something. Simple and efficient.

0
source

If you need to find specific strings, just treat everything as a string and run certain regular expressions to dig up your gems.

If you need to extract more data, I believe that with a little work you can create a good parser for your data. As a launch, I would use the following functions:

 def extract_screens(text): """ Returns a list of screens (divided by astericks). Each screen is a list of strings stripped from asterisks. """ ... def process_screen(screen): """ Returns a list of screen divisions as tuples: [(heading, body)...] heading is a string, body is a list of strings blank lines are filtered out. """ ... 

You should now have an indexed list of text snippets. You can skip them and perform a simple and special special parser method for each section.

Tip. Use unit tests to keep yourself sane.

0
source

All Articles