Extract specific section from LaTeX file using python

I have a set of LaTeX files. I would like to highlight an โ€œabstractโ€ section for each of them:

\begin{abstract} ..... \end{abstract} 

I tried the sentence here: How to parse LaTex file

And tried:

 A = re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data) 

If the data contains text from a LaTeX file. But A is just an empty list. Any help would be greatly appreciated!

+5
source share
2 answers

.* does not match newlines unless the re.S flag is specified:

 re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data, re.S) 

Example

Consider this test file:

 \documentclass{report} \usepackage[margin=1in]{geometry} \usepackage{longtable} \begin{document} Title maybe \begin{abstract} Good stuff \end{abstract} Other stuff \end{document} 

This gets the abstract:

 >>> import re >>> data = open('a.tex').read() >>> re.findall(r'\\begin{abstract}(.*?)\\end{abstract}', data, re.S) ['\nGood stuff\n'] 

Documentation

On the web page of the re module:

re.S
re.DOTALL

Make a '.' a special character matches any character in everything, including a new line; without this flag ". will match anything but a new line.

+5
source

. does not match the newline character. However, you can pass a flag to ask it to include newline characters.

Example:

 import re s = r"""\begin{abstract} this is a test of the linebreak capture. \end{abstract}""" pattern = r'\\begin\{abstract\}(.*?)\\end\{abstract\}' re.findall(pattern, s, re.DOTALL) #output: ['\nthis is a test of the\nlinebreak capture.\n'] 
+1
source

All Articles