How to save source html file using Apache Nutch

Question

How to save source html file using Apache Nutch

I am new to search engines and web crawlers. Now I want to store all the source pages on a specific website as html files, but with Apache Nutch I can only get the binary files for the database. How to get html source files using Nutch?

Does Nutch support this? If not, what other tools can I use to achieve my goal. (Tools that support common workaround are better.)

+4

search-engine web-crawler nutch

Freedom Apr 04 '12 at 8:06

source share

5 answers

You must make changes to run Nutch in Eclipse .

When you can start, open Fetcher.java and add lines between the "content saver" command line commands.

 case ProtocolStatus.SUCCESS: // got a page pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth); updateStatus(content.getContent().length);' //------------------------------------------- content saver ---------------------------------------------\\ String filename = "savedsites//" + content.getUrl().replace('/', '-'); File file = new File(filename); file.getParentFile().mkdirs(); boolean exist = file.createNewFile(); if (!exist) { System.out.println("File exists."); } else { FileWriter fstream = new FileWriter(file); BufferedWriter out = new BufferedWriter(fstream); out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html"))); out.close(); System.out.println("File created successfully."); } //------------------------------------------- content saver ---------------------------------------------\\

+6

İsmet Alkan Apr 20 '12 at 11:01

source share

To update this answer -

You can process the data from your crawldb segment folder and directly read it in html (including other data).

  Configuration conf = NutchConfiguration.create(); FileSystem fs = FileSystem.get(conf); Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); try { Text key = new Text(); Content content = new Content(); while (reader.next(key, content)) { System.out.println(new String(content.GetContent())); } } catch (Exception e) { }

+5

andrew.butkus Oct 9 '13 at 14:16

source share

In apache nutch 2.3.1
You can save the raw HTML by editing the Nutch code, first run the nut in the eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse

After you finish the fortune-telling in the eclipse file edit FetcherReducer.java, add this code to the output method, run ant eclipse again to rebuild the class

Finally, raw html will be added to the reportUrl column in your database

 if (content != null) { ByteBuffer raw = fit.page.getContent(); if (raw != null) { ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining()); Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } fit.page.setReprUrl(StringUtil.cleanField(data)); scanner.close(); }

0

Ahmed sakr Feb 21 '18 at 7:36

source share

The answers here are out of date. Now just get simple HTML files using nutch dump . See this answer .

0

CC Mar 12 '18 at 8:49

source share

Tejas patil · Accepted Answer · 2012-04-08T03:30:27+0000

Well, nutch will write workarounds in binary form, so if you want this to be saved in html format, you will have to change the code. (it will be painful if you are new to nutch).

If you need a quick and easy solution for getting html pages:

If the list of pages / URLs you intend to have is pretty low, you better do this with a script that calls wget for each URL.
OR use the HTTrack tool.

EDIT:

Writing your own nutch plugin will be great. Your problem will be solved, plus you can contribute to the work by presenting your work !!! If you are new to nutch (in terms of code and design), you will have to spend a lot of time creating a new plugin ... it's still easy to make.

A few pointers to help your initiative:

Here is a page that talks about writing your own nutch plugin.

Start with Fetcher.java . See Lines 647-648. This is the place where you can get the downloaded content based on each URL (for those pages that were successfully received).

 pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS); updateStatus(content.getContent().length);

You must add the code immediately afterwards to call your plugin. Pass the content object to it. By now, you would assume that content.getContent() is the content for the URL you want. Inside the plugin code, write it to some file. The file name must be based on the name of the URL, otherwise it will be difficult to work with. Url can be obtained using fit.url .

How to save source html file using Apache Nutch

More articles: