Python Blog RSS Feed Scrambling BeautifulSoup Output to .txt Files

Sorry in advance for the long block of the following code. I am new to BeautifulSoup, but found that some useful tutorials were used to clean it up for RSS feeds for blogs. Full disclosure: this is a code adapted from this video tutorial that was extremely useful for getting this from the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE .

Here's my problem: the video does a great job of printing the appropriate content on the console. I need to write each article text in a separate .txt file and save it in some directory (right now I'm just trying to save it on my desktop). I know that the problem is that the area of ​​the two for-loops loops is closer to the end of the code (I tried to comment on this so people quickly see this is the last beginning of the comment # Here where I got lost ...), but I can’t figure it out on your own.

Currently, the program makes text from the last article read by the program and writes it to the number of TXT files that are specified in the variable listIterator. So, in this case, I believe that 20 .txt files have been written out, but they all contain the text of the last article, which went in cycles. What I want the program to do is loop over each article and output the text of each article to a separate .txt file. Sorry for the verbosity, but any understanding will be truly appreciated.

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
    # Print each title to console to ensure program is working. 
    print findPatTitle[i]

    # Read in the linked-to article.
    articlePage = urlopen(findPatLink[i]).read()

    # Find the beginning and end of articles using tags listed below.
    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above.
    article = articlePage[divBegin:divEnd]

    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(article)

    # Compile list of all <p> tags for each article and store in paragList
    paragList = soup.findAll('p')

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files.
    para_string = ''

    # Here where I'm lost and have some sort of scope issue with my for-loops.
    for i in paragList:
        para_string = para_string + str(i)
        newlist = range(len(findPatTitle))
        for i in newlist:
            ofile = open(str(listIterator[i])+'.txt', 'w')
            ofile.write(para_string)
            ofile.close()
+4
source share
1 answer

, , , , 20 . :

for i in paragList:
    para_string = para_string + str(i)
    newlist = range(len(findPatTitle))
    for i in newlist:
        ofile = open(str(listIterator[i])+'.txt', 'w')
        ofile.write(para_string)
        ofile.close()

parag_string 20 . , parag_string , paraStringList, , :

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

, , .. for i in listIterator:(...). :

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

    print findPatTitle[i]

    articlePage = urlopen(findPatLink[i]).read()

    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    article = articlePage[divBegin:divEnd]

    soup = BeautifulSoup(article)

    paragList = soup.findAll('p')

    para_string = ''

    for i in paragList:
        para_string += str(i)

    paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)
+3

All Articles