Tutorial 5 - Dataframes
6 important questions on Tutorial 5 - Dataframes
What is a DataFrame?
- They are built on top of RDD's and are also immutable. Once it's created it cannot be changed. Each transformation creates a new DataFrame.
- Must have a schema, meaning it must consist of columns with a name and a type.
What is a defining feature of Spark vs e.g. Hadoop?
What is the code to create a DataFrame?
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
What is the code to check the number of partitions?
What is a better method to visualize data than collect()?
What does filter() do? Give an example of code for filtering age less than 10.
- It only keeps values that match the filter expression.
- Each tasks makes a new partition with entries from the original partition that e.g. Have an 'age' column value less than 10.
filteredDF = subDF.filter(subDF.age <10)
so: data.filter(data.column *insert expression*)
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding