Home / Summaries / Syllabus Introduction to systems biology / distance-data-objects

Top-down systems biology

Q: How do newer dimension reduction techniques compare to PCA?

They are much more efficient in displaying the data i.e. they need much less dimensions than PCA to summarize all the relevant variation in the data. They derive distance from the distribution of the data. Disadvantage: the new axes can no longer be related to the original variables.

20 important questions on Top-down systems biology

What is a multivariate dataset? Give an example.

A dataset represeting measurements for multiple variables.
E.g. A RNA-seq measurement:

RNA-seq measures number of mRNA molecules present in cell
humans: approx 20.000 genes --> generate 20.000 integers:
each gene is represented by it's own variable, so 20.000 variables per sample
If you have 20 samples, from 20 tissues or 20 persons: 20 x 20.000 dataset

How can we map multivariate data in space? (Invented by Déscartes!)

Multivariate data sets can be considered as sets of n-tuples, where every one of the n variables that you measure on one sample occupies a position in the tuple.
We let every position in the tuple, i.e. every variable, correspond to one axis in an n-dimensional space. Each tuple then maps as one point in this space.

Why is PCA suitable for dimension reduction?

An interesting phenomenon is that, when PCA is properly applied, the higher principal components, those with less variance, contain mostly measurement noise.
This implies that you can “throw away” these axes, i.e. the information about the position of each data point on these axes, without losing useful information about the data.

How do newer dimension reduction techniques compare to PCA?

They are much more efficient in displaying the data i.e. they need much less dimensions than PCA to summarize all the relevant variation in the data.
They derive distance from the distribution of the data.
Disadvantage: the new axes can no longer be related to the original variables.

Why is t-SNE currently a popular dimension reduction technique? Also name it's disadvantages.

T-SNE = t-distribution based Stochastic Neighbour Embedding
Advantage: it shows grouping of samples much better than most other dimension reduction techniques
Disadvantage 1: it is very calculation-intensive for large numbers of measurements
Disadvantage 2: t it is a non-deterministic procedure. Repeating the calculation yields different results.

What is meant by “t-SNE derives distance from the distribution of the data”?

Figure: e.g. Measurements on differentiating cells.

interpret clusters as fully differentiated types of cells and those in the middle in the process.
Clusters are close together in the plane, but the data suggests a longer path between them!
Only path = detour via circle
A t-SNE analysis picks up on this contextually derived distance (distance derived from the way the data is distributed)

What is unsupervised learning?

learning from data without using any additional knowledge about the samples to which the data apply.
aim: discover interesting aspects of the data, in particular:

existence of groups of samples
principal axes of variance in the data

Name 3 techniques often used in unsupervised learning.

Principal component analysis
hierarchical clustering
k-means clustering

Explain the concept of Euclidian distance.

It corresponds to the distance measured along a straight line between two objects.
2D space: De(x, y) = sqrt( (x1 − y1) 2 + (x2 − y2) )
Often used: most intuïtive: but if you drive car or bike, it is not useful...

Base pair difference is an important distance metric in biology. Which methods are used for this calculation? (2)

Hamming distance
Jukes-Cantor distance

Distance between communities of species is an important distance metric in biology. Which two methods are used for this calculation?

Jaccard distance
Bray-Curtis distance

What do clustering alhorithms do?

They group together objects (samples, for example) based on distances derived from multivariate observations on these objects. These groups are the “structure” in the data.
They are also a huge reduction in information: instead by the original 20 000 variables, samples may now be characterized by their membership of a limited number of groups
What clustering algorithms have in common is that they need a distance concept both for two objects as well as for an object and a group of objects

Explain the concept of K-means clustering.

K cluster centres are determined and each sample is assigned to one cluster centre.
The objective is to search for those K cluster centres (called centroids) and cluster assignments of all data points, so that the total distance of all data points to their cluster centre is minimized.
Based on Euclidian distance

Explain the algorithm of K-means clustering.

The number of clusters, K, is chosen
The initial cluster centres are K different, randomly chosen points
For each point the distance to each cluster centre is calculated
Each point is assigned to the nearest cluster centre
Cluster centres are re-calculated as the centre of assigned points, i.e. the mean of their positions
Steps 3–5 are iterated until cluster assignments do not change any more

Explain the concept of hierarchical clustering (in contrast to K-means clustering).

In contrast to K-means clustering: you don't have to define a number of clusters (K) in advance.
--> Disadvantage: outcome is more complex:

not just exclusive groups of samples
but hierarchical structure of sample groups
--> visualized in dendrograms

Hierarchical clustering algorithms can be classified by the way they calculate distances between groups of objects. Name three subgroups.

Single linkage: distance between groups = distance between nearest objects from both groups
Complete linkage: distance between groups = distance between furtherst objects from both groups
Average linkage: distance between groups = average of distances of all pairs of objects from both groups (UPGMA)

*this classification is independent of the distance metric that is used!

Describe a simple procedure to identify biomarkers among RNA expression levels.

Identify whether epxression values are statistically significantly correlated with the disease status of interest
If there are two classes: t-test
Multiple classes (benign, stage 1, stage 2) --> ANOVA

Describe a method to build a classifier.

Generate a decision tree
Each decision point in a decision tree is constructed from a split-point based on a variable
Impure? --> add more split points (eg start with gene 1, then further down look at gene 2)

What does the Gini impurity describe?

It can be interpreted as the average fraction of wrongly labelled objects if the class labels present in the original set were randomly distributed over the objects.

How is the Gini impurity of two sets calculated? And the decrease of Gini impurity after introducing a new split point?

The Gini impurity of two sets A and B of objects, G(A, B) is defined as the weighted sum of Gini impurities of the individual sets, as defined above. The weight is proportional to the fraction of objects in a set relative to all sets.

(See screenshot)

The question on the page originate from the summary of the following study material:

Syllabus Introduction to systems biology

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Top-down systems biology

20 important questions on Top-down systems biology

What is a multivariate dataset? Give an example.

How can we map multivariate data in space? (Invented by Déscartes!)

Why is PCA suitable for dimension reduction?

How do newer dimension reduction techniques compare to PCA?

Why is t-SNE currently a popular dimension reduction technique? Also name it's disadvantages.

What is meant by “t-SNE derives distance from the distribution of the data”?

What is unsupervised learning?

Name 3 techniques often used in unsupervised learning.

Explain the concept of Euclidian distance.

Base pair difference is an important distance metric in biology. Which methods are used for this calculation? (2)

Distance between communities of species is an important distance metric in biology. Which two methods are used for this calculation?

What do clustering alhorithms do?

Explain the concept of K-means clustering.

Explain the algorithm of K-means clustering.

Explain the concept of hierarchical clustering (in contrast to K-means clustering).

Hierarchical clustering algorithms can be classified by the way they calculate distances between groups of objects. Name three subgroups.

Describe a simple procedure to identify biomarkers among RNA expression levels.

Describe a method to build a classifier.

What does the Gini impurity describe?

How is the Gini impurity of two sets calculated? And the decrease of Gini impurity after introducing a new split point?

Summaries related to Top-down systems biology

Syllabus Introduction to systems biology

Structural Bioinformatics

Class notes - Algorithms in Sequence Analysis

Class notes - Biosystems Data Analysis

Indian Economics

Global politics

Essentials of international relations

Behavioral genetics

Management and organisational behaviour

Follow Up Engels idioom 4/5 H

International Business

Marketing fundamentals