How to get html content from nutch

Is there any way to get the html content of each webpage in a reptile while crawling the webpage?

+6
nutch
source share
4 answers

Yes, you can export the contents of bypass segments. It is not easy, but it works well for me. First create a java project with the following code:

import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; import java.io.File; import java.io.FileOutputStream; public class NutchSegmentOutputParser { public static void main(String[] args) { if (args.length != 2) { System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir"); return; } try { Configuration conf = NutchConfiguration.create(); FileSystem fs = FileSystem.get(conf); String segment = args[0]; File outDir = new File(args[1]); if (!outDir.exists()) { if (outDir.mkdir()) { System.out.println("Creating output dir " + outDir.getAbsolutePath()); } } Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); Text key = new Text(); Content content = new Content(); while (reader.next(key, content)) { String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim(); File f = new File(outDir.getCanonicalPath() + "/" + filename); FileOutputStream fos = new FileOutputStream(f); fos.write(content.getContent()); fos.close(); System.out.println(f.getAbsolutePath()); } reader.close(); fs.close(); } catch (Exception e) { e.printStackTrace(); } } 

}

I recommend using Maven; add the following dependencies:

  <dependency> <groupId>org.apache.nutch</groupId> <artifactId>nutch</artifactId> <version>1.5.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>0.23.1</version> </dependency> 

and create a jar package (i.e. NutchSegmentOutputParser.jar)

You need to install Hadoop on your computer. Then run:

 $/hadoop-dir/bin/hadoop --config \ NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \ NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir 

where nutch-crawled-dir / 2012xxxxxxxxx / is the bypass directory from which you want to extract the content (it contains the "segment" subdirectory), and outdir is the output directory. Output file names are generated from the URI, but slashes are replaced with "_".

Hope this helps.

+8
source share

Try the following:

 public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { Parse parse = parseResult.get(content.getUrl()); LOG.info("parse.getText: " +parse.getText()); return parseResult; } 

Then check the contents in hadoop.log .

+1
source share

Its super basic.

 public ParseResult getParse(Content content) { LOG.info("getContent: " + new String(content.getContent())); 

The Content object has a getContent () method that returns an array of bytes. Just create Java to create String () with BA, and you have the raw html of any chickpea.

I am using Nutch 1.9

Here's the JavaDoc on org.apache.nutch.protocol.Content https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/protocol/Content.html#getContent ()

0
source share

Yes, there is a way. Take a look at cache.jsp to see how it displays cached data.

-2
source share

All Articles