Python find link to download file on web page

Question

Python find link to download file on web page

I need a regular expression that will return me the text contained between double quotes, which starts with the specified text block and ends with a specific file extension (e.g. .txt). I use urllib2 to get the html page (html is pretty simple).

Basically, if I have something like

<tr> <td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td> <td><a href="Client-8.txt">new_Client-8.txt</a></td> <td align="right">27-Jun-2012 18:02 </td> </tr>

He should just come back to me.

 Client-8.txt

If the return value is contained in double quotes. I know how the file name starts with "Client-" and the file extension is ".txt".

I play with r.search (regex, string) where the input string is the html of the page. But I stink with regular expressions.

Thanks!

+4

python regex web-scraping urllib2 beautifulsoup

Zacatttt Jun 29 '12 at 20:54

source share

2 answers

 soup = BeautifulSoup('<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="Client-8.txt">new_Client-8.txt</a></td><td align="right">27-Jun-2012 18:02 </td>') x=soup.findAll('a') for i in x: if '.txt' in i['href']: print(i['href'])

+1

Ashwini chaudhary Jun 29 '12 at 21:05

source share

Simeon visser · Accepted Answer · 2012-06-29T20:56:27+0000

You should not use regular expressions for this task. It is much easier to write a script with BeautifulSoup to process the HTML code and find the element you need.

In your case, you need to look for all the <a> elements, the href attribute starts with Client- and ends with .txt . This will give you a list of all the files.

Python find link to download file on web page

More articles: