The result is presented repeatedly.

I am currently using simple_html_dom to clear the website here , to see the website I am scraping, everything is returning well, except that it continues to post the same content for every single message that it scratches. View here to see a demo.

$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : ''; $html = file_get_html('http://screenrant.com/movie-news/'.$page); foreach($html->find('#site-top > div.site-wrapper > div.top-content > article > section > ul > li > div.info > h2 > a') as $element) { print '<br><br>'; echo $url = ''.$element->href; $html2 = file_get_html($url); $image = $html2->find('meta[property=og:image]',0); $news['image'] = $image->content; #print '<br><br>'; // Ending The Featured Image #site-top > div.site-wrapper > div.top-content > article > section > ul > li:nth-child(2) $title = $html2->find('#site-top > div.site-wrapper > div.top-content > article > header.single-header > h1',0); $news['title'] = $title->plaintext; // Ending the titles print '<br>'; #site-top > div.site-wrapper > div.top-content > article > div $articles = $html2->find('#site-top > div.site-wrapper > div.top-content > article > div > p'); foreach ($articles as $article) { #echo "$article->plaintext<p>"; $news['content'] = $news['content'] . $article->plaintext . "<p>"; } print '<pre>';print_r($news);print '</pre>'; print '<br><br>'; // mysqli_query($DB,"INSERT INTO `wp_scraped_news` SET // `hash` = '".$news['title']."', // `title` = '".$news['title']."', // `image` = '".$news['image']."', // `content` = '".$news['content']."'"); // print '<pre>';print_r($news);print '</pre>'; } 

I have no idea where I am wrong, but I assume that this is one of two things, and I did not believe both of these things without luck.

1. I'm doing something wrong with the way my foreach .

2. The website changes selectors for each new article.

In both cases, I’m probably mistaken .. but I worked with them both for 2 hours, and at the time of refusing them. Any help is greatly appreciated.

+7
php
source share
1 answer

The problem is that you are not clearing the old content from $news['content'] . Therefore, when you process the second page, you add its contents to the contents of the first page. The third page is added to this again, etc.

Placed

 $news['content'] = ''; 

before

 foreach ($articles as $article) { 
+4
source share

All Articles