Home / Summaries / knowledge-management-and-business-intelligence-2

Summary: Knowledge Management And Business Intelligence

This + 400k other summaries
A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Use this summary

Remember faster, study better. Scientifically proven.

Read the summary and the most important questions on Knowledge Management and Business Intelligence

1 Data Science Foundations
1.1 Preprocessing

This is a preview. There are 10 more flashcards available for chapter 1.1
Show more cards here
What are the steps in pre processing?

Identify data sources
Select data
Clean data
Transform data
What type of sample biasses are there?

Sample selection bias: consider the selection mechanism
Seasonality effects: consider the handling of time
How to treat missing values?

1. Remove:
Eliminate rows or columns. But could mean deleting usefull information

2. Replace missing values
- acquire true values: contact, purchase
- imputation techniques: replace by mean, prediction

3. Keep!
- add variable called missing, or introduce dummy

4. Weight-of-evidence
How to detect and treat outliers?

Z = (x - mean) / st.dev
If Z is > 3 it could be an outlier

Reduce impact by keeping the max value at z=3? Replace with 99% percentile

Multivariate outliers: if multiple dimensions are considered simultaneously. Often just ignore them
What is feature engineering?

Enrich data set as to increase predictive performance

For instance: time-flattening: removing the time dimension by defining features that summarize performance period.
Or transforming from unstructured to structured data.
What is variable transformation?

Normalization: rescale variables to typically [0,1]
Standardisation: rescale data to have a mean zero and st.dev of one.
Transformation: to a normal distribution

Advanced transformations: Box-Cox, Yeo-Johnson, Principle Component Analysis
How to handle course classifications?

Pivotting tables and regrouping in order to create more distinction. Done via the Chi-squared test. The bigger its value, the better.
Why change a continuous variable to categorical?

Interpretability: some prefer age segments

Allows to incorporate non-linear relations within a linear model. And thus improve perfromance

Sometimes for anonymization, or different applications.
How does weight-of-evidence work?

WOE = ln (Distr. Good/Distr.Bad)*100
Why take the ln of the "relative odds" and not the absolute odds? This way WOE is independant of class distribution and permits easy interpretation.

Information Value: IV = Sum(distr.good.cat - distr.bad.cat)*woe.cat)

Category boundaries can be given so as to maximise the predictive powers in terms of IV

# of categories is a trade-off: fewer is simpler. More is to keep predictive power.

Binning: questions wether its with or without interaction
What are the pros and cons of WOE?

All-in-one solution:
- categorical to continuous
- continuous to categorical
- missing values
- outliers
- assessment of predictive strength
- nonlinear relations in a linear, interpretable model

Drawbacks: some loss of predictive power?