Import
import java.io.Serializable; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row;
Create a POJO class for the URL. I would recommend you write for a log line, which consists of URL, date, time, method, target, etc. As members
public static class Url implements Serializable { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } }
Create RDD Url Objects from a Text File
JavaRDD<Url> urlsRDD = spark.read() .textFile("/Users/karuturi/Downloads/log.txt") .javaRDD() .map(new Function<String, Url>() { @Override public Url call(String line) throws Exception { String[] parts = line.split("\\t"); Url url = new Url(); url.setValue(parts[0].replaceAll("[", "")); return url; } });
Create DataFrame from RDD
Dataset<Row> urlsDF = spark.createDataFrame(urlsRDD, Url.class);
RDD to DataFrame - Spark 2.0
RDD to DataFrame - Spark 1.6
mrsrinivas
source share