Home / Summaries / Class notes - Big Data / data-regression-linear

Tutorial 7 - Pipelines and regression

Q: How can you read a bunch of files into a single dataframe?

Use data=spark.read.csv('.../foldername/', header=True, inferSchema=True)

Q: How can you get a basic statistical summary of your data? What does it show?

Use data.describe().show() Shows count, mean, stddev, min and max per column.

Q: How can you prepare the data for machine learning?

Use Vector Assembler (import it) to transform it into a vector of features. Then type the code: columns = ['column1', 'column2', 'column3', 'column4'] vectorizer = VectorAssembler(inputCols=columns,outputCol='features') dataset = vectorizer.transform(data)

Q: How do you prepare the data and evaluate how well your linear regression model predicts power output?

Use the randomSplit() function to divide into a test and a training set.

Q: How do you create a linear regression model?

from pyspark.ml.regression import LinearRegression from pyspark.ml.regression import LinearRegressionModel from pyspark.ml import Pipeline Lr = LinearRegression

Q: Which 2 parameters are not optional in linear regression model and how do you set them?

Name of the label column to the values to learn Name of the prediction column , where the prediction values should be stored. Set them by: lr.setPredictionCol('predicted_PE').setLabelCol('PE')

Q: How is a linear regression model created that has been trained with the training dataset?

By applying the lrPipeline of the training dataset: LrModel = lrPipeline.fit(trainingSetDF)

11 important questions on Tutorial 7 - Pipelines and regression

What are the 5 parts of big data analysis?

Understanding the business/the problem
Loading data
Exploring data
Visualizing data
Preparing data

How can you read a bunch of files into a single dataframe?

Use data=spark.read.csv('.../foldername/', header=True, inferSchema=True)

How can you check the types of the columns?

Data.dtypes
Data.printSchema()

How can you get a basic statistical summary of your data? What does it show?

Use data.describe().show()
Shows count, mean, stddev, min and max per column.

How can you easily visualize the dataframe?

Use display(data)

How can you prepare the data for machine learning?

Use Vector Assembler (import it) to transform it into a vector of features.
Then type the code:

columns = ['column1', 'column2', 'column3', 'column4']
vectorizer = VectorAssembler(inputCols=columns,outputCol="features")
dataset = vectorizer.transform(data)

How do you prepare the data and evaluate how well your linear regression model predicts power output?

Use the randomSplit() function to divide into a test and a training set.

How do you create a linear regression model?

from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml import PipelineLr = LinearRegression

Which 2 parameters are not optional in linear regression model and how do you set them?

Name of the label column to the values to learn
Name of the prediction column, where the prediction values should be stored.

Set them by:
lr.setPredictionCol('predicted_PE').setLabelCol('PE')

What is a pipeline?

It is created to put together the vectorization and the linear regression learner.

lrPipeline = Pipeline()
lrPipeline.setStages([vectorizer, lr])

How is a linear regression model created that has been trained with the training dataset?

By applying the lrPipeline of the training dataset:
LrModel = lrPipeline.fit(trainingSetDF)

The question on the page originate from the summary of the following study material:

Big Data

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Tutorial 7 - Pipelines and regression

11 important questions on Tutorial 7 - Pipelines and regression

What are the 5 parts of big data analysis?

How can you read a bunch of files into a single dataframe?

How can you check the types of the columns?

How can you get a basic statistical summary of your data? What does it show?

How can you easily visualize the dataframe?

How can you prepare the data for machine learning?

How do you prepare the data and evaluate how well your linear regression model predicts power output?

How do you create a linear regression model?

Which 2 parameters are not optional in linear regression model and how do you set them?

What is a pipeline?

How is a linear regression model created that has been trained with the training dataset?

Summaries related to Requirement Analysis

Class notes - Big Data

Marketing management.

Fundamentals of Management, Global Edition

Marketing Models

Fundamentals of Strategy

Lecture 1 - Foraging Ecology

College 5 - Dogs

College 6 - Poultry

College 7 - Pigs I

College 8 - Pigs II

Chapter 9 - Homeostasis and behavior

H16 - Regulatie van de maagdarmkanaal functi…