Social Question
Python+Spark beginner: What am I not learning when I practice on a local Spark cluster?
I’m starting to learn to use Python and Spark. I’m spending most of my time doing structured tutorials and exploring small local datasets. But I’m trying to get used to tools and habits that will give me a smooth transition into doing very distributed computation.
I have been practicing with PySpark by running a local cluster using spark=SparkSession.builder.master(“local[4]”).appName(“local”).getOrCreate() or similar, followed by “spark.stuff”.
By doing things this way, I’m sure I’m missing out on experience that I will need if I ever run PySpark using a remote cluster, doing complicated stuff, etc. Can y’all recommend me some tasks to do that would help complement my practicing on a local cluster?
I’m interested in beginner-level tasks that would help prepare me both for the syntax and the higher-level organization involved in using PySpark to do very distributed computation.
One idea that I had was to make an HDInsight cluster on Azure and do stuff on there. Does that make sense?