Python find link to download file on web page

I need a regular expression that will return me the text contained between double quotes, which starts with the specified text block and ends with a specific file extension (e.g. .txt). I use urllib2 to get the html page (html is pretty simple).

Basically, if I have something like

<tr> <td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td> <td><a href="Client-8.txt">new_Client-8.txt</a></td> <td align="right">27-Jun-2012 18:02 </td> </tr> 

He should just come back to me.

 Client-8.txt 

If the return value is contained in double quotes. I know how the file name starts with "Client-" and the file extension is ".txt".

I play with r.search (regex, string) where the input string is the html of the page. But I stink with regular expressions.

Thanks!

+4
source share
2 answers

You should not use regular expressions for this task. It is much easier to write a script with BeautifulSoup to process the HTML code and find the element you need.

In your case, you need to look for all the <a> elements, the href attribute starts with Client- and ends with .txt . This will give you a list of all the files.

+4
source
 soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02 </td>') x=soup.findAll('a') for i in x: if '.txt' in i['href']: print(i['href']) 
+1
source

All Articles