Python3 pickle read sparks as input

Question

Python3 pickle read sparks as input

My data is available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames.

I would like to start using Spark because I need more memory and a processor that one computer can have. In addition, I will use HDFS for distributed storage.

As a beginner, I did not find relevant information explaining how to use pickle files as input.

He exists? If not, is there a workaround?

Thank you so much

+4

python-3.x serialization apache-spark pyspark rdd

Michael hooreman Mar 26 '16 at 8:56

source share

1 answer

zero323 · Accepted Answer · 2016-03-26T20:08:42+0000

. , Spark , , . binaryFiles Python. :

import tempfile
import pandas as pd
import numpy as np

outdir = tempfile.mkdtemp()

for i in range(5):
    pd.DataFrame(
        np.random.randn(10, 2), columns=['foo', 'bar']
    ).to_pickle(tempfile.mkstemp(dir=outdir)[1])

bianryFiles:

rdd = sc.binaryFiles(outdir)

:

import pickle
from io import BytesIO

dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]

##         foo       bar
## 0 -0.162584 -2.179106
## 1  0.269399 -0.433037
## 2 -0.295244  0.119195

, , , textFile.

, , hdfs3. .

, , .

:

SparkContext pickleFile, . SequenceFiles, , .

Python3 pickle read sparks as input

More articles: