As I'm going through DEV 360, I noticed the number of errors, differences in output, etc.
For Lection 2:
- It looks like that for the video lecture the input CSV file contained slightly different dataset, For example, items_sold in lecture is 628, while the same code on the provided data returned 627. The number of xbox bids is 2784 in data, and 2811 in lecture;
- Code for XBox-only records could potentially select more records, if any user will have name with xbox in it - maybe it should be explicitly mentioned;
- When constructing the Auctions items, then we can use already defined names instead of numeric constants - the code looks more readable:
val auctionsRDD = inputRDD.map(a => Auctions(a(auctionid), a(bid).toFloat, a(bidtime).toFloat,
a(price).toFloat, a(itemtype), a(daystolive).toInt))
- In DEV360_LabGuide, on the page 10, we define the bidsAuctionRDD variable to keep data about auctions, but in the next question, this RDD is refered as bidsItemRDD. And this names are also different from the names in the provided solution: bids_auctionRDD. It would be nice to unify the names;
- In the lab 2.2.2, I was need to explicitly import the sql functions to make the max/min/avg working in groupBy.agg call:
although it could be specific of the my setup (I've used the local Spark 1.6.1 instead of Sandbox).
For Lecture 3:
- When talking about dependencies - mention that version of Spark dependencies should match to the version inside the cluster, and that the same Scala version should be used;
- It would be nice to add some hints about performance optimization, memory model, etc. (maybe in Lecture 2);It would be nice to mention tools for interactive development: IPython + PySpark, Spark Notebook, Apache Zeppelin
Thank you for good intro into Spark!