Tutorial 7 - Pipelines and regression

What are the 5 parts of big data analysis?

  1. Understanding the business/the problem
  2. Loading data
  3. Exploring data
  4. Visualizing data
  5. Preparing data

How can you read a bunch of files into a single dataframe?

Use data=spark.read.csv('.../foldername/', header=True, inferSchema=True)

How can you check the types of the columns?

  • Data.dtypes
  • Data.printSchema()
How can you get a basic statistical summary of your data? What does it show?

  • Use data.describe().show()
  • Shows count, mean, stddev, min and max per column.

How can you easily visualize the dataframe?

Use display(data)

How can you prepare the data for machine learning?

  1. Use Vector Assembler (import it) to transform it into a vector of features.
  2. Then type the code:
  • columns = ['column1', 'column2', 'column3', 'column4']
  • vectorizer = VectorAssembler(inputCols=columns,outputCol="features")
  • dataset = vectorizer.transform(data)

How do you prepare the data and evaluate how well your linear regression model predicts power output?

Use the randomSplit() function to divide into a test and a training set.

How do you create a linear regression model?

from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml import PipelineLr = LinearRegression

Which 2 parameters are not optional in linear regression model and how do you set them?

  • Name of the label column to the values to learn
  • Name of the prediction column, where the prediction values should be stored.
Set them by:

What is a pipeline?

It is created to put together the vectorization and the linear regression learner.

lrPipeline = Pipeline()
lrPipeline.setStages([vectorizer, lr])  

How is a linear regression model created that has been trained with the training dataset?

By applying the lrPipeline of the training dataset:
LrModel = lrPipeline.fit(trainingSetDF)

