Tutorial 7 - Pipelines and regression
11 important questions on Tutorial 7 - Pipelines and regression
What are the 5 parts of big data analysis?
- Understanding the business/the problem
- Loading data
- Exploring data
- Visualizing data
- Preparing data
How can you read a bunch of files into a single dataframe?
How can you check the types of the columns?
- Data.dtypes
- Data.printSchema()
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
How can you get a basic statistical summary of your data? What does it show?
- Use data.describe().show()
- Shows count, mean, stddev, min and max per column.
How can you easily visualize the dataframe?
How can you prepare the data for machine learning?
- Use Vector Assembler (import it) to transform it into a vector of features.
- Then type the code:
- columns = ['column1', 'column2', 'column3', 'column4']
- vectorizer = VectorAssembler(inputCols=columns,outputCol="features")
- dataset = vectorizer.transform(data)
How do you prepare the data and evaluate how well your linear regression model predicts power output?
How do you create a linear regression model?
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import LinearRegressionModel
from pyspark.ml import PipelineLr = LinearRegression
Which 2 parameters are not optional in linear regression model and how do you set them?
- Name of the label column to the values to learn
- Name of the prediction column, where the prediction values should be stored.
lr.setPredictionCol('predicted_PE').setLabelCol('PE')
What is a pipeline?
lrPipeline = Pipeline()
lrPipeline.setStages([vectorizer, lr])
How is a linear regression model created that has been trained with the training dataset?
LrModel = lrPipeline.fit(trainingSetDF)
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding