Summary: Data Science & Big Data Analytics | 9781118876138 | EMC education services
- This + 400k other summaries
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding
Read the summary and the most important questions on Data Science & Big Data Analytics | 9781118876138 | EMC education services
-
5 Textmining
This is a preview. There are 1 more flashcards available for chapter 5
Show more cards here -
term frequency (tf)
measure of term density in a doc -
inverse document frequency (idf)
measure of discriminating power of term:
its rarity across the whole corpus;
•More rare -> more important -> small value should result in higher score
•More common -> less important -> high value should result in low score -
6 Oefententamens
This is a preview. There are 13 more flashcards available for chapter 6
Show more cards here -
K means clustering is a form of unsupervised datamining algorithm. What is meant by this statement?
You don’t have to label a certain attribute in advance as a label of interest. So, no human intervention is needed to separate the observed attributes in two groups: those that predict and those that are to be predicted 1pt -
Give an example of a supervised datamining algorithm.
Lineair regression, … (several answers correct)1pt -
Consider the following bags of words taken from a large set of internet pages:A={“iot”, “datamining”}B={“datamining, “hadoop”}C={“hadoop”}D={“R”, ”datamining”}E={“hadoop”, “datamining”}F={“iot”,”datamining”,”R”,”hadoop”}Minimum support=0.5For the frequent subsets with a length of 2, list all possible association rules.
{“hadoop”} -> {“datamining”}
{“datamining”} -> {“hadoop”} -
Consider the following bagsof words taken from a large set of internet pages:A={“iot”, “datamining”}B={“datamining, “hadoop”}C={“hadoop”}D={“R”, ”datamining”}E={“hadoop”, “datamining”}F={“iot”,”datamining”,”R”,”hadoop”}Minimum support=0.51)Calculate for each rule the confidence and the lift. Which rule do you consider as the most interesting?
1 pt
Con({“hadoop”} -> {“datamining”}) = (3/6) / (4/6) =0.75
Lift({“hadoop”} -> {“datamining”}) =(3/4) / ( 5/6)= 0.90
1 pt
Con ({“datamining”} -> {“hadoop”})=(3/6) / (5/6)=0.60
Lift({“datamining”} -> {“hadoop”})= (3/5) / (4/6) = 0.90
2pt the one with the highest confidence and the highest lift, rule 1: 2pt -
1)The next outcome in R is due to trying to model a person’s height to predict their weight. Explain how this outcome could be used to predict the weight of a person of height 1,80
Y=a*X +b
Y=106*1.80 -114.3
Y=77,4
2pt
Alleen uitleg, geen berekening: 1pt -
1)How could you use the chart below to explain the quality of the model?
R square ~ 49,7 not that good, not that bad. In the graph you can see that with a larger Height the variance in weight increases. I would say that the model is moderate OK. -
1)The well-known data from titanic.csv can be used build a logistic regression model for the attribute Survived Yes/No. The outcome in R is shown below.Sample of the data:Outcome of logistic regression in R: What is your conclusion on the model, when you compare the attributes Pclass, Sexmale, Age and Fare?
You should leave out Fare as an explanatory variable. 1pt -
1)There is a large set of labeled pictures, you want to classify a new picture.a)Pick an algorithm and motivate your choice.b)Name one of the main parameters involved in this algorithm.c)How does it influence the outcome of the algorithm?
a) Neural network, because of the number of variables involved 3 pt
b) Number of layers, size of layers, learning rate, … (several answers correct)1pt
c) The more layers the more calculating power is needed, but the higher the accuracy 1pt
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding