They are built on top of RDD's and are also immutable . Once it's created it cannot be changed. Each transformation creates a new DataFrame. Must have a schema , meaning it must consist of columns with a name and a type.

Home / Summaries / Class notes - Big Data / data-filter-dataframe

Tutorial 5 - Dataframes

Q: What is a defining feature of Spark vs e.g. Hadoop?

That Spark stores its data in memory rather than on disk , which allows faster run of applications because there is no slowing down due to reading data from a disk.

Q: What is a better method to visualize data than collect()?

Show() , shows the first 20 rows if no amount is given. Produces the nice table format.

Q: What does filter() do? Give an example of code for filtering age less than 10.

It only keeps values that match the filter expression . Each tasks makes a new partition with entries from the original partition that e.g. Have an 'age' column value less than 10. Code example: filteredDF = subDF.filter(subDF.age <10) so: data.filter(data.column *insert expression*)

6 important questions on Tutorial 5 - Dataframes

What is a DataFrame?

They are built on top of RDD's and are also immutable. Once it's created it cannot be changed. Each transformation creates a new DataFrame.
Must have a schema, meaning it must consist of columns with a name and a type.

What is a defining feature of Spark vs e.g. Hadoop?

That Spark stores its data in memory rather than on disk, which allows faster run of applications because there is no slowing down due to reading data from a disk.

What is the code to create a DataFrame?

dataDF = sqlContext.createDataFrame(data,('column names'))

What is the code to check the number of partitions?

DataDF.rdd.getNumPartitions()

What is a better method to visualize data than collect()?

Show(), shows the first 20 rows if no amount is given. Produces the nice table format.

What does filter() do? Give an example of code for filtering age less than 10.

It only keeps values that match the filter expression.
Each tasks makes a new partition with entries from the original partition that e.g. Have an 'age' column value less than 10.

Code example:
filteredDF = subDF.filter(subDF.age <10)
so: data.filter(data.column *insert expression*)

The question on the page originate from the summary of the following study material:

Big Data

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Tutorial 5 - Dataframes

6 important questions on Tutorial 5 - Dataframes

What is a DataFrame?

What is a defining feature of Spark vs e.g. Hadoop?

What is the code to create a DataFrame?

What is the code to check the number of partitions?

What is a better method to visualize data than collect()?

What does filter() do? Give an example of code for filtering age less than 10.

Summaries related to Requirement Analysis

Class notes - Big Data

Marketing management.

Fundamentals of Management, Global Edition

Marketing Models

Fundamentals of Strategy

Lecture 1 - Foraging Ecology

College 5 - Dogs

College 6 - Poultry

College 7 - Pigs I

College 8 - Pigs II

Chapter 9 - Homeostasis and behavior

H16 - Regulatie van de maagdarmkanaal functi…