Summary: Knowledge Management And Business Intelligence
- This + 400k other summaries
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding
Read the summary and the most important questions on Knowledge Management and Business Intelligence
-
1 Data Science Foundations
-
1.1 Preprocessing
This is a preview. There are 10 more flashcards available for chapter 1.1
Show more cards here -
What are the steps in pre processing?
Identify data sources
Select data
Clean data
Transform data -
What type of sample biasses are there?
Sample selection bias: consider the selection mechanism
Seasonality effects: consider the handling of time -
How to treat missing values?
1.Remove :
Eliminate rows or columns. But could meandeleting usefull information
2.Replace missing values
- acquire true values: contact,purchase
-imputation techniques : replace by mean,prediction
3. Keep!
- addvariable called missing, orintroduce dummy
4.Weight-of -evidence -
How to detect and treat outliers?
Z = (x - mean) / st.dev
If Z is > 3 it could be an outlier
Reduce impact by keeping the max value at z=3? Replace with 99% percentile
Multivariate outliers: if multiple dimensions are considered simultaneously. Often just ignore them -
What is feature engineering?
Enrich data set as to increase predictive performance
For instance: time-flattening: removing the time dimension by defining features that summarize performance period.
Or transforming from unstructured to structured data. -
What is variable transformation?
Normalization: rescale variables to typically [0,1]
Standardisation: rescale data to have a mean zero and st.dev of one.
Transformation: to a normal distribution
Advanced transformations: Box-Cox, Yeo-Johnson, Principle Component Analysis -
How to handle course classifications?
Pivotting tables and regrouping in order to create more distinction. Done via the Chi-squared test. The bigger its value, the better. -
Why change a continuous variable to categorical?
Interpretability: some prefer age segments
Allows to incorporate non-linear relations within a linear model. And thus improve perfromance
Sometimes for anonymization, or different applications. -
How does weight-of-evidence work?
WOE = ln (Distr. Good/Distr.Bad)*100
Why take the ln of the "relative odds" and not the absolute odds? This way WOE is independant of class distribution and permits easy interpretation.
Information Value: IV = Sum(distr.good.cat - distr.bad.cat)*woe.cat)
Category boundaries can be given so as to maximise the predictive powers in terms of IV
# of categories is a trade-off: fewer is simpler. More is to keep predictive power.
Binning: questions wether its with or without interaction -
What are the pros and cons of WOE?
All-in-one solution:
- categorical to continuous
- continuous to categorical
- missing values
- outliers
- assessment of predictive strength
- nonlinear relations in a linear, interpretable model
Drawbacks: some loss of predictive power?
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding