File_get_content ('http://en.wikipedia.org/wiki/Category:Upcoming_singles');

Question

File_get_content ('http://en.wikipedia.org/wiki/Category:Upcoming_singles');

file_get_content('http://en.wikipedia.org/wiki/Category:Upcoming_singles');

returns another result (2 products)

whereas visiting the same address with Chrome returns 4 products.

After checking, I suspect that this may be due to

Saved in the cache of the parser key with ... timestamp ...

in the returned html. Timestamp is older when I use file_get_content()

Any ideas on how to get the latest information using file_get_content() ?

READ!

+4

php caching screen-scraping wikipedia

Wonder jun bae Oct 6 '11 at 10:50

source share

4 answers

Wikimedia Agent policy requires that all requests identify themselves. I highly recommend not faking the browser user agent. It's not needed.

Millions of cars regularly visit Wikipedia and other Wikimedia Foundation projects. Just indicate yourself, your script, it's not complicated!

 // Identify yourself by your bot, script, company, whatever ini_set( 'user_agent', 'MyBot/1.0; John Doe (contact: info@example.og )' ); // Open the file using the HTTP headers set above $contents = file_get_contents( 'http://en.wikipedia.org/wiki/Sandbox' ); echo $contents;

+2

Krinkle Nov 09 '11 at 0:29

source share

In any case, you really should use the MediaWiki API instead of trying to clear the screen of information accessible to humans. For example, try this query using list=categorymembers .

Some notes:

Choose the appropriate result format (which for PHP is probably format=php ).
The default limit is 10 results per query, but you can increase it to 500 with cmlimit=max . After that, you will need a mechanism for continuing queries .

You can also use one of the existing MediaWiki API client libraries to take care of these and other small details for you.

And finally, please do a good job with Wikimedia servers: do not send multiple simultaneous requests and cache the results locally if you need them again in the near future. It is recommended that you include your contact information (URL or email address) in the User-Agent header so that Wikimedia system administrators can easily contact you if your code causes excessive server load.

+2

Ilmari karonen Nov 09 '11 at 2:36

source share

Try using cURL and adjust the header to get the latest information, not cache (Sorry, I can’t remember the exact header to install)

+1

Jack Oct 6 '11 at 10:55

source share

santiagobasulto · Accepted Answer · 2011-10-06T23:02:36+0000

Assuming file_get_contents makes an http request, it would be nice to check the specified user agent.

I heard about problems getting data with some user agents. Take a look at this question .

You can specify other parameters (including the user agent) using the stream context:

 <?php $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" ) ); $context = stream_context_create($opts); // Open the file using the HTTP headers set above $file = file_get_contents('http://www.example.com/', false, $context);

Take a look at the_get_contents docs file.

Also, as Jack said, cURL is the best option.

EDIT:

You're wrong. What you have to add is another user agent. For example, using a user agent from mozilla firefox, you get 4 results:

 <?php $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; es-AR; rv:1.9.2.23) Gecko/20110921 Ubuntu/10.10 (maverick) Firefox/3.6.23" ) ); $context = stream_context_create($opts); // Open the file using the HTTP headers set above $file = file_get_contents('http://en.wikipedia.org/wiki/Category:Upcoming_singles', false, $context); print $file;

But, I think it’s not “legal”, it’s not good to cheat it. I think that there should be any other user agent that wikipedia provides to extract its data from external applications.

File_get_content ('http://en.wikipedia.org/wiki/Category:Upcoming_singles');

More articles: