Tutorial 5 - Dataframes

6 important questions on Tutorial 5 - Dataframes

What is a DataFrame?

  • They are built on top of RDD's and are also immutable. Once it's created it cannot be changed. Each transformation creates a new DataFrame.
  • Must have a schema, meaning it must consist of columns with a name and a type.

What is a defining feature of Spark vs e.g. Hadoop?

That Spark stores its data in memory rather than on disk, which allows faster run of applications because there is no slowing down due to reading data from a disk.

What is the code to create a DataFrame?

dataDF = sqlContext.createDataFrame(data,('column names'))
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

What is the code to check the number of partitions?

DataDF.rdd.getNumPartitions()

What is a better method to visualize data than collect()?

Show(), shows the first 20 rows if no amount is given. Produces the nice table format.

What does filter() do? Give an example of code for filtering age less than 10.

  • It only keeps values that match the filter expression.
  • Each tasks makes a new partition with entries from the original partition that e.g. Have an 'age' column value less than 10.
Code example:
filteredDF = subDF.filter(subDF.age <10)
so: data.filter(data.column *insert expression*)

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo