pyspark.SparkContext.pickleFile¶

SparkContext.pickleFile(name: str, minPartitions: Optional[int] = None) → pyspark.rdd.RDD[Any][source]¶

Load an RDD previously saved using RDD.saveAsPickleFile() method.

New in version 1.1.0.

Parameters

namestr: directory to the input data files, the path can be comma separated paths as a list of inputs
minPartitionsint, optional: suggested minimum number of partitions for the resulting RDD

Returns

RDD: RDD representing unpickled data from the file(s).

See also

RDD.saveAsPickleFile()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # Write a temporary pickled file
...     path1 = os.path.join(d, "pickled1")
...     sc.parallelize(range(10)).saveAsPickleFile(path1, 3)
...
...     # Write another temporary pickled file
...     path2 = os.path.join(d, "pickled2")
...     sc.parallelize(range(-10, -5)).saveAsPickleFile(path2, 3)
...
...     # Load picked file
...     collected1 = sorted(sc.pickleFile(path1, 3).collect())
...     collected2 = sorted(sc.pickleFile(path2, 4).collect())
...
...     # Load two picked files together
...     collected3 = sorted(sc.pickleFile('{},{}'.format(path1, path2), 5).collect())

>>> collected1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> collected2
[-10, -9, -8, -7, -6]
>>> collected3
[-10, -9, -8, -7, -6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

pyspark.SparkContext.parallelize pyspark.SparkContext.range