Home / Summaries / data-science-and-big-data-analytics-services

Summary: Data Science & Big Data Analytics | 9781118876138 | EMC education services

Name: Data Science & Big Data Analytics
ISBN: 9781118876138

Summary: Data Science & Big Data Analytics | 9781118876138 | EMC education services Book cover image

This + 400k other summaries
A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

PLEASE KNOW!!! There are just 47 flashcards and notes available for this material. This summary might not be complete. Please search similar or other summaries.

Use this summary

Remember faster, study better. Scientifically proven.

Read the summary and the most important questions on Data Science & Big Data Analytics | 9781118876138 | EMC education services

5 Textmining

This is a preview. There are 1 more flashcards available for chapter 5
Show more cards here
term frequency (tf)

measure of term density in a doc
inverse document frequency (idf)

measure of discriminating power of term:
its rarity across the whole corpus;

•More rare -> more important -> small value should result in higher score
•More common -> less important -> high value should result in low score
6 Oefententamens

This is a preview. There are 13 more flashcards available for chapter 6
Show more cards here
K means clustering is a form of unsupervised datamining algorithm. What is meant by this statement?

You don’t have to label a certain attribute in advance as a label of interest. So, no human intervention is needed to separate the observed attributes in two groups: those that predict and those that are to be predicted 1pt
Give an example of a supervised datamining algorithm.

Lineair regression, … (several answers correct)1pt
Consider the following bags of words taken from a large set of internet pages:A={“iot”, “datamining”}B={“datamining, “hadoop”}C={“hadoop”}D={“R”, ”datamining”}E={“hadoop”, “datamining”}F={“iot”,”datamining”,”R”,”hadoop”}Minimum support=0.5For the frequent subsets with a length of 2, list all possible association rules.

{“hadoop”} -> {“datamining”}
{“datamining”} -> {“hadoop”}
Consider the following bagsof words taken from a large set of internet pages:A={“iot”, “datamining”}B={“datamining, “hadoop”}C={“hadoop”}D={“R”, ”datamining”}E={“hadoop”, “datamining”}F={“iot”,”datamining”,”R”,”hadoop”}Minimum support=0.51)Calculate for each rule the confidence and the lift. Which rule do you consider as the most interesting?

1 pt
Con({“hadoop”} -> {“datamining”}) = (3/6) / (4/6) =0.75
Lift({“hadoop”} -> {“datamining”}) =(3/4) / ( 5/6)= 0.90
1 pt
Con ({“datamining”} -> {“hadoop”})=(3/6) / (5/6)=0.60
Lift({“datamining”} -> {“hadoop”})= (3/5) / (4/6) = 0.90

2pt the one with the highest confidence and the highest lift, rule 1: 2pt
1)The next outcome in R is due to trying to model a person’s height to predict their weight. Explain how this outcome could be used to predict the weight of a person of height 1,80

Y=a*X +b
Y=106*1.80 -114.3
Y=77,4
2pt
Alleen uitleg, geen berekening: 1pt
1)How could you use the chart below to explain the quality of the model?

R square ~ 49,7 not that good, not that bad. In the graph you can see that with a larger Height the variance in weight increases. I would say that the model is moderate OK.
1)The well-known data from titanic.csv can be used build a logistic regression model for the attribute Survived Yes/No. The outcome in R is shown below.Sample of the data:Outcome of logistic regression in R: What is your conclusion on the model, when you compare the attributes Pclass, Sexmale, Age and Fare?

You should leave out Fare as an explanatory variable. 1pt
1)There is a large set of labeled pictures, you want to classify a new picture.a)Pick an algorithm and motivate your choice.b)Name one of the main parameters involved in this algorithm.c)How does it influence the outcome of the algorithm?

a) Neural network, because of the number of variables involved 3 pt

b) Number of layers, size of layers, learning rate, … (several answers correct)1pt
c) The more layers the more calculating power is needed, but the higher the accuracy 1pt

PLEASE KNOW!!! There are just 47 flashcards and notes available for this material. This summary might not be complete. Please search similar or other summaries.

Read the full summary

This summary +380.000 other summaries A unique study tool A rehearsal system for this summary Studycoaching with videos

Discover Study Smart