How to configure Spark on Windows?

Question

How to configure Spark on Windows?

I am trying to configure Apache Spark on Windows.

After a little searching, I understand that offline is what I want. What binaries are loaded to run the Apache spark in windows? I see distributions with hadoop and cdh on the spark download page.

I have no internet links to this. A step-by-step guide to this is much appreciated.

+57

windows apache-spark

Siva Aug 25 '14 at 7:50

source share

9 answers

Steps to install Spark in local mode:

Install Java 7 or later . To verify java installation, run the java command prompt and press enter. If you received the message 'Java' is not recognized as an internal or external command. You need to set up the environment variables JAVA_HOME and PATH to indicate the path to jdk.
Download and install Scala .
Set SCALA_HOME to Control Panel\System and Security\System go to "Adv System settings" and add %SCALA_HOME%\bin to the PATH variable in the environment variables.
Install Python 2.6 or later from Python Download link .
Download SBT . Set it and set SBT_HOME as an environment variable with a value like <<SBT PATH>> .
Download winutils.exe from the HortonWorks repo or git repo . Since we do not have a local Hadoop installation on Windows, we need to download winutils.exe and put it in the bin directory in the Hadoop home directory we created. Set HADOOP_HOME = <<Hadoop home directory>> to the environment variable.
We will use the ready-made Spark package, so choose the ready-made Spark package for the Hadoop Spark download . Download and extract it.
Set SPARK_HOME and add %SPARK_HOME%\bin to the PATH variable in the environment variables.
Run Command: spark-shell
Open http://localhost:4040/ in a browser to see the SparkContext web interface.

+93

Ani Menon Aug 03 '16 at 5:36

source share

You can download the spark here:

http://spark.apache.org/downloads.html

I recommend you this version: Hadoop 2 (HDP2, CDH5)

Starting with version 1.0.0, there are .cmd scripts for launching sparks in windows.

Unzip it using 7zip or similar.

You can run / bin / spark -shell.cmd -master local [2] to start

To configure your instance, you can go to the following link: http://spark.apache.org/docs/latest/

+19

ajnavarro Aug 26 '14 at 7:24

source share

You can use the following installation methods for Spark:

Source building
Using the finished release

Although there are various ways to create Spark from Source .
At first I tried to create a spark source using SBT, but this requires chaos. To avoid these problems, I used a pre-created release.

Instead of the source, I downloaded the finished release for hadoop 2.x and ran it. To do this, you need to install Scala as a prerequisite.

I have collected all the steps here:
How to run Apache Spark on Windows7 offline

Hope this helps you .. !!!

+15

Nishu Tayal Apr 16 '15 at 7:46

source share

Trying to work with spark-2.xx, creating Spark source code did not help me.

So, although I'm not going to use Hadoop, I downloaded the pre-built Spark with nested chaop: spark-2.0.0-bin-hadoop2.7.tar.gz
Specify SPARK_HOME in the extracted directory, then add to PATH : ;%SPARK_HOME%\bin;
Download the winutils executable from the Hortonworks repository.
Create the directory in which you place the winutils.exe executable file. For example, C: \ SparkDev \ x64. Add the %HADOOP_HOME% environment variable that points to this directory, then add %HADOOP_HOME%\bin to PATH.
Using the command line, create a directory:
```
 mkdir C:\tmp\hive 
```
Using the executable file you downloaded, add full permissions to the file directory you created, but using the unixian formalism:
```
 %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive 
```
Enter the following command line:
```
 %SPARK_HOME%\bin\spark-shell 
```

Scala command line input should be displayed automatically.

Note: You do not need to configure Scala separately. It is also integrated.

+4

Farah Aug 30 '16 at 16:42 on

source share

Here are the fixes to get it working on Windows without restoring everything - for example, if you don't have the latest version of MS-VS. (You will need the Win32 C ++ compiler, but you can install the free version of MS VS Community Edition.)

I tried this with Spark 1.2.2 and mahout 0.10.2, as well as with the latest versions in November 2015. There are a number of problems, including the fact that Scala code tries to run a bash script (mahout / bin / mahout) that does not work, sbin scripts were not ported to windows, and winutils are missing if hasoop is not installed.

(1) Install scala, then unzip / hadoop / mahout to the root C: under their respective product names.

(2) Rename \ mahout \ bin \ mahout to mahout.sh.was (we won’t need this)

(3) Compile the following Win32 C ++ program and copy the executable to a file named C: \ mahout \ bin \ mahout (this right is not a .exe suffix, for example, a Linux executable)

 #include "stdafx.h" #define BUFSIZE 4096 #define VARNAME TEXT("MAHOUT_CP") int _tmain(int argc, _TCHAR* argv[]) { DWORD dwLength; LPTSTR pszBuffer; pszBuffer = (LPTSTR)malloc(BUFSIZE*sizeof(TCHAR)); dwLength = GetEnvironmentVariable(VARNAME, pszBuffer, BUFSIZE); if (dwLength > 0) { _tprintf(TEXT("%s\n"), pszBuffer); return 0; } return 1; }

(4) Create a script \ mahout \ bin \ mahout.bat and paste the contents below, although the exact names of the jars in the _CP class paths will depend on the versions of the spark and the mahout. Update any paths in your installation. Use 8.3 path names with no spaces in them. Please note: you cannot use wildcards / asterisks in pathpaths here.

 set SCALA_HOME=C:\Progra~2\scala set SPARK_HOME=C:\spark set HADOOP_HOME=C:\hadoop set MAHOUT_HOME=C:\mahout set SPARK_SCALA_VERSION=2.10 set MASTER=local[2] set MAHOUT_LOCAL=true set path=%SCALA_HOME%\bin;%SPARK_HOME%\bin;%PATH% cd /D %SPARK_HOME% set SPARK_CP=%SPARK_HOME%\conf\;%SPARK_HOME%\lib\xxx.jar;...other jars... set MAHOUT_CP=%MAHOUT_HOME%\lib\xxx.jar;...other jars...;%MAHOUT_HOME%\xxx.jar;...other jars...;%SPARK_CP%;%MAHOUT_HOME%\lib\spark\xxx.jar;%MAHOUT_HOME%\lib\hadoop\xxx.jar;%MAHOUT_HOME%\src\conf;%JAVA_HOME%\lib\tools.jar start "master0" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip localhost --port 7077 --webui-port 8082 >>out-master0.log 2>>out-master0.err start "worker1" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://localhost:7077 --webui-port 8083 >>out-worker1.log 2>>out-worker1.err ...you may add more workers here... cd /D %MAHOUT_HOME% "%JAVA_HOME%\bin\java" -Xmx4g -classpath "%MAHOUT_CP%" "org.apache.mahout.sparkbindings.shell.Main"

The variable name MAHOUT_CP should not be changed, since it refers to C ++ code.

Of course, you can comment on the code that runs the Spark wizard and worker, because Mahout will run Spark as needed; I just put it in a batch job to show you how to run it if you want to use Spark without Mahout.

(5) The following tutorial is a good place to start:

 https://mahout.apache.org/users/sparkbindings/play-with-shell.html

You can call the Mahout Spark instance at:

 "C:\Program Files (x86)\Google\Chrome\Application\chrome" --disable-web-security http://localhost:4040

+3

Emul Nov 24 '15 at 14:20

source share

Here are seven steps to install a spark in Windows 10 and run it from python:

Step 1: upload the gz zero 2.2.0 tar (tape Archive) file to any folder F from this link - https://spark.apache.org/downloads.html , Unzip it and copy the unpacked folder to the desired folder A. Rename the spark -2.2.0-bin-hadoop2.7 to folder.

Let the path to the spark folder be C: \ Users \ Desktop \ A \ spark

Step 2: upload the hardtop 2.7.3 tar gz file to the same F folder from this link - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7. 3.tar.gz. Unzip it and copy the unzipped folder to the same folder A. Rename the folder name from Hadoop-2.7.3.tar to hasoop. Let the path to the hadoop folder be C: \ Users \ Desktop \ A \ hadoop

Step 3: Create a new text file in Notepad. Save this blank notepad as winutils.exe (with saving as type: all files). Copy this O KB winutils.exe file to the bin folder in spark mode - C: \ Users \ Desktop \ A \ spark \ bin

Step 4: Now we need to add these folders to the system environment.

4a: create a system variable (not a user variable, since the user variable inherits all the properties of the system variable) Variable name: SPARK_HOME Variable value: C: \ Users \ Desktop \ A \ spark

Find the Path Path system variable and click Modify. You will see several ways. Do not delete any of the paths. Add this value to the variable -; C: \ Users \ Desktop \ A \ spark \ bin

4b: create a system variable

Variable Name: HADOOP_HOME Variable Value: C: \ Users \ Desktop \ A \ hadoop

Find the Path Path system variable and click Modify. Add this value to the variable -; C: \ Users \ Desktop \ A \ hadoop \ bin

4c: create a system variable Variable name: JAVA_HOME Search Java in windows. Right-click and click on the open file. You will have to right-click again on any of the java files and click on the open file location. You will use the path to this folder. OR you can search for C: \ Program Files \ Java. My version of Java installed on the system is jre1.8.0_131. Variable value: C: \ Program Files \ Java \ jre1.8.0_131 \ bin

Find the Path Path system variable and click Modify. Add this value to the variable -; C: \ Program Files \ Java \ jre1.8.0_131 \ bin

Step 5: open a command prompt and go to the folder with the spark hopper (type cd C: \ Users \ Desktop \ A \ spark \ bin). Type of spark sheath.

 C:\Users\Desktop\A\spark\bin>spark-shell

This may take some time and give some warnings. Finally, it will show welcome to the spark version 2.2.0

Step 6: Enter exit () or restart the command line and again go to the intrinsic safety folder. Pyspark type:

 C:\Users\Desktop\A\spark\bin>pyspark

It will show some warnings and errors, but ignores it. He works.

Step 7: Your download is complete. If you want to directly launch the spark from the python shell then: go to Scripts in the python folder and type

 pip install findspark

on the command line.

In python shell

 import findspark findspark.init()

import the necessary modules

 from pyspark import SparkContext from pyspark import SparkConf

If you want to skip the steps to import findpark and initialize it, please follow the procedure given in import pyspark in the python shell

+1

Aakash Saxena Jul 28 '17 at 18:40

source share

Below is a simple minimum script to run from any python console. It is assumed that you have extracted the Spark libraries loaded in C: \ Apache \ spark-1.6.1.

This works on Windows without creating anything and solves problems when Spark complains about recursive etching.

 import sys import os spark_home = 'C:\Apache\spark-1.6.1' sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python\lib\pyspark.zip')) sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.9-src.zip')) # Start a spark context: sc = pyspark.SparkContext() # lines = sc.textFile(os.path.join(spark_home, "README.md") pythonLines = lines.filter(lambda line: "Python" in line) pythonLines.first()

0

HansHarhoff Jun 28 '16 at 17:27

source share

Ani Menon's manual (thanks!) Almost worked for me on Windows 10, I just needed to get a new winutils.exe from this git (currently hasoop-2.8.1): https://github.com/steveloughran/winutils

0

Chris Oct 26 '17 at 8:57

source share

jkgeyti · Accepted Answer · 2014-08-25 12:19

I found that the easiest solution on Windows is to build from the source code.

You can pretty much follow this guide: http://spark.apache.org/docs/latest/building-spark.html

Download and install Maven and set MAVEN_OPTS to the value specified in the manual.

But if you just play with Spark and don’t really need it to run on Windows for any other reason when Windows is running on your own machine, I highly recommend that you install Spark on a Linux virtual machine. The easiest way to get started - This is to download ready-made images made by Cloudera or Hortonworks, and either use the included version of Spark, or install your own source or compiled binary files that you can get from the spark site.

How to configure Spark on Windows?

More articles: