Set up Spark and PySpark in WSL Ubuntu 22.04
- [ ] install java,
sudo apt install openjdk-8-jre-headless
- [ ] check java version,
java --version
- [ ] download Spark,
wget <https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
>
- [ ] create directory,
mkdir ~/hadoop/spark-3.5.0
if error, create directory ~/hadoop
first.
- [ ] extract Spark to created directory,
tar -xvzf spark-3.5.0-bin-hadoop3.tgz -C ~/hadoop/spark-3.5.0 --strip 1
- [ ] add the following line to end of
.bashrc
, can use vi ~/.bashrc
export SPARK_HOME=~/hadoop/spark-3.5.0
export PATH=$PATH:$SPARK_HOME/bin
- [ ] test and access pyspark shell,
$SPARK_HOME/bin/pyspark
Use PySpark via Jupyter notebook
- [ ] create conda environment (miniconda) or python venv
conda create --name pyspark_env python=3.10
- [ ] activate the environment,
conda activate pyspark_env
- [ ] install packages
- [ ]
pip install pyspark
- [ ]
pip install findspark
- [ ] test and access pyspark shell with run command
pyspark
, then exit
- [ ]
pip install jupyter
- [ ] optional
- [ ]
pip install mlflow
- [ ]
pip install polars
- [ ]
pip install flaml[spark]