Web data scraper (news comments) using Scrapy (Python)

I want to clear web comment data from online news feeds exclusively for research. And I noticed that I need to learn about Scrapy ...

I usually program Python. I though it will be easy to find out. But I'm having problems.

I want to copy the news commentary at http://news.yahoo.com/congress-wary--but-unlikely-to-blow-up-obama-s-iran-deal-230545228.html .

But the problem is that there is a button (> View Comments (452)) to see the comments. In addition, what I want to do is clear all comments in the news. Unfortunately, I have to click another button (Show comments) to see the other 10 comments.

How can I deal with this problem?

The code I made is below. Sorry for the bad code.

#############################################
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["news.yahoo.com"]

   start_urls = ["http://news.yahoo.com/blogs/oddnews/driver-offended-by-%E2%80%9Cwh0-r8x%E2`%80%9D-license-plate-221720503.html",]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/p')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.xpath('/text()').extract()
           items.append(item)
       return items  

You can see how much is left to solve my problem. But I have to hurry. I will do my best anyway.

+4
source share
2 answers

Since it seems to you that the type try-first ask-question later (which is very good), I will not give you an answer, but a (very detailed) guide to finding the answer.

, , Yahoo, , , , . , . , , . Chrome Developer Tools , , Firebug FF.

, , . , , " ", , . , , , , , . :

  • javascript ( ).
  • AJAX , ( , , , ).

. :

<span>View Comments (2077)</span>

, , , . . , devtools, . . . , chrome devtools. devtools . , :

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1

? _ xhr, get_comments. . JSON ( python), , . , , , , . - , :

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

. , :

  • , , . 100, , , . , - " ". , ,

  • Content_id, , , . , . , .

  • , , , 10 , , , - ( , , )

  • devtools, . , , , /get_comments/ javascript- YUI. , , ( , , )

  • , . , , . . , , , .

  • , JSON ( python ), HTML- ( BeautifulSoup).

, , , , , .

.

, (, WEB, ). , - , , , . - .

!

+3

, yahoo, , yahoo , . -, 3 URL-, , , . , .

urlComments = 'https://www.yahoo.com/news/_td/api/resource/canvass.getMessageListForContext_ns;context=%1s;count=10;index=%1s;lang=en-US;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;rankingProfile=canvassHalfLifeDecayProfile;region=US;sortBy=popular;type=null;userActivity=true'
urlReply = 'https://www.yahoo.com/news/_td/api/resource/canvass.getReplies_ns;context=%1s;count=10;index=%1s;lang=en-US;messageId=%1s;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;tags='
urlUser = 'https://www.yahoo.com/news/_td/api/resource/canvass.getUserMessageHistory;count=10;index=%1s;lang=en-US;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;userId=%1s'

URL- %1s, URL , , . , :

params = {'bkt': ["news-d-202","newsdmcntr"],
    'device': 'desktop',
    'feature': 'cacheContentCanvas,videoDocking,newContentAttribution,livecoverage,featurebar,deferModalCluster,specRetry,newLayout,sidepic,canvassOffnet,ntkFilmstrip,autoNotif,CanvassTags',
    'intl': 'us',
    'lang': 'en-US',
    'partner': 'none',
    'prid': '5t11qvhclanab',
    'region': 'US',
    'site': 'fp',
    'tz': 'America/PICKACITY',  <-- insert a city
    'ver': '2.0.7765',
    'returnMeta': 'true'}

requests, , .

response = requests.get(u, params=params) #u is a url from above
coms = response.json()['data']['canvassMessages'] #drop the ['canvassMessages'] if you want to get replies to a thread

, , . coms 10 ( URL-, count=10 - , max, -, 30). 10, coms[-1]['index'] URL- 10. , , , 1000 , yahoo . , , 1000-1009 ( "index", - v=1:s=popular:sl=1498836633:off=1000). , 1010-1019, . , - , , . , , "pstaid", , . , . "pstaid", 0efc85df-eb0b-373e-b6f3-4c513ed2a415, , URL-.

0

All Articles