A Flyte SDK (v2) version of this plugin is available as flyteplugins-spark.

Spark

flytekitplugins-spark

FlytekitML Trainingsparkpysparkdistributedbig-data

Flyte can execute Spark jobs natively on a Kubernetes Cluster, which manages a virtual cluster’s lifecycle, spin-up, and tear down. It leverages the open-sourced Spark On K8s Operator and can be enabled without signing up for any service. This is like running a transient spark cluster — a type of cluster spun up for a specific Spark job and torn down after completion.

Install

pip install flytekitplugins-spark

Quick Start(example, may need adjustment)

See full examples

pip install flytekitplugins-spark

from flytekit import task, workflow
from flytekitplugins.spark import DatabricksConnector, GenericSparkConf, PySparkPipelineModelTransformer, SparkDataFrameSchemaReader

@task
def my_task() -> None:
    new_spark_session(...)

@workflow
def my_workflow() -> None:
    my_task()

Available Imports (12)

connectorDatabricksConnector

Add DatabricksConnectorV2 to support running the k8s spark and databricks spark together in the same workflow.

from flytekitplugins.spark import DatabricksConnector

configGenericSparkConf

Configuration type for Spark.

extends dataclass — configuration or data structure for plugin setup

from flytekitplugins.spark import GenericSparkConf

transformerPySparkPipelineModelTransformer

Configuration type for Spark.

extends TypeTransformer — converts python types to/from flyte-native types

from flytekitplugins.spark import PySparkPipelineModelTransformer

typeSparkDataFrameSchemaReader

Implements how SparkDataFrame should be read using the ``open`` method of FlyteSchema.

from flytekitplugins.spark import SparkDataFrameSchemaReader

typeSparkDataFrameSchemaWriter

Implements how SparkDataFrame should be written to using ``open`` method of FlyteSchema.

from flytekitplugins.spark import SparkDataFrameSchemaWriter

transformerSparkDataFrameTransformer

Transforms Spark DataFrame's to and from a Schema (typed/untyped).

extends TypeTransformer — converts python types to/from flyte-native types

from flytekitplugins.spark import SparkDataFrameTransformer

handlerParquetToSparkDecodingHandler

Configuration type for Spark.

from flytekitplugins.spark import ParquetToSparkDecodingHandler

handlerSparkToParquetEncodingHandler

Configuration type for Spark.

from flytekitplugins.spark import SparkToParquetEncodingHandler

configDatabricks

Add DatabricksConnectorV2 to support running the k8s spark and databricks spark together in the same workflow.

extends dataclass — configuration or data structure for plugin setup

from flytekitplugins.spark import Databricks

configDatabricksV2

Use this to configure a Databricks task.

extends dataclass — configuration or data structure for plugin setup

from flytekitplugins.spark import DatabricksV2

configSpark

Implements how SparkDataFrame should be read using the ``open`` method of FlyteSchema.

extends dataclass — configuration or data structure for plugin setup

from flytekitplugins.spark import Spark

tasknew_spark_session

Task for Spark.

from flytekitplugins.spark import new_spark_session

Dependencies

pysparkaiohttppandas

Related Plugins

Spark

ML Training

Union can execute Spark jobs natively on a Kubernetes Cluster, which manages a virtual cluster’s lifecycle, spin-up, and tear down. It leverages the open-sourced Spark On K8s Operator and can be enabled without signing up for any service. This is like running a transient spark cluster — a type of cluster spun up for a specific Spark job and torn down after completion.