Top-down systems biology

20 important questions on Top-down systems biology

What is a multivariate dataset? Give an example.

A dataset represeting measurements for multiple variables.
E.g. A RNA-seq measurement:
  • RNA-seq measures number of mRNA molecules present in cell
  • humans: approx 20.000 genes --> generate 20.000 integers:
  • each gene is represented by it's own variable, so 20.000 variables per sample
  • If you have 20 samples, from 20 tissues or 20 persons: 20 x 20.000 dataset

How can we map multivariate data in space? (Invented by Déscartes!)

  • Multivariate data sets can be considered as sets of n-tuples, where every one of the n variables that you measure on one sample occupies a position in the tuple.
  • We let every position in the tuple, i.e. every variable, correspond to one axis in an n-dimensional space. Each tuple then maps as one point in this space.

Why is PCA suitable for dimension reduction?

  • An interesting phenomenon is that, when PCA is properly applied, the higher principal components, those with less variance, contain mostly measurement noise.
  • This implies that you can “throw away” these axes, i.e. the information about the position of each data point on these axes, without losing useful information about the data.
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

How do newer dimension reduction techniques compare to PCA?

  • They are much more efficient in displaying the data i.e. they need much less dimensions than PCA to summarize all the relevant variation in the data.
  • They derive distance from the distribution of the data.
  • Disadvantage: the new axes can no longer be related to the original variables.

Why is t-SNE currently a popular dimension reduction technique? Also name it's disadvantages.

  • T-SNE = t-distribution based Stochastic Neighbour Embedding
  • Advantage: it shows grouping of samples much better than most other dimension reduction techniques
  • Disadvantage 1: it is very calculation-intensive for large numbers of measurements
  • Disadvantage 2: t it is a non-deterministic procedure. Repeating the calculation yields different results.

What is meant by “t-SNE derives distance from the distribution of the data”?

Figure: e.g. Measurements on differentiating cells.
  • interpret clusters as fully differentiated types of cells and those in the middle in the process.
  • Clusters are close together in the plane, but the data suggests a longer path between them!
  • Only path = detour via circle
  • A t-SNE analysis picks up on this contextually derived distance (distance derived from the way the data is distributed)

What is unsupervised learning?

  • learning from data without using any additional knowledge about the samples to which the data apply.
  • aim:  discover interesting aspects of the data, in particular:
    • existence of groups of samples
    • principal axes of variance in the data

Name 3 techniques often used in unsupervised learning.

  • Principal component analysis
  • hierarchical clustering
  • k-means clustering

Explain the concept of Euclidian distance.

  • It corresponds to the distance measured along a straight line between two objects.
  • 2D space: De(x, y) = sqrt( (x1 − y1) 2 + (x2 − y2) )
  • Often used: most intuïtive: but if you drive car or bike, it is not useful...

Base pair difference is an important distance metric in biology. Which methods are used for this calculation? (2)

  • Hamming distance
  • Jukes-Cantor distance

Distance between communities of species is an important distance metric in biology. Which two methods are used for this calculation?

  • Jaccard distance
  • Bray-Curtis distance

What do clustering alhorithms do?

  • They group together objects (samples, for example) based on distances derived from multivariate observations on these objects. These groups are the “structure” in the data.
  • They are also a huge reduction in information: instead by the original 20 000 variables, samples may now be characterized by their membership of a limited number of groups
  • What clustering algorithms have in common is that they need a distance concept both for two objects as well as for an object and a group of objects

Explain the concept of K-means clustering.

  • K cluster centres are determined and each sample is assigned to one cluster centre.
  • The objective is to search for those K cluster centres (called centroids) and cluster assignments of all data points, so that the total distance of all data points to their cluster centre is minimized.
  • Based on Euclidian distance

Explain the algorithm of K-means clustering.

  1. The number of clusters, K, is chosen
  2. The initial cluster centres are K different, randomly chosen points
  3. For each point the distance to each cluster centre is calculated
  4. Each point is assigned to the nearest cluster centre
  5. Cluster centres are re-calculated as the centre of assigned points, i.e. the mean of their positions
  6. Steps 3–5 are iterated until cluster assignments do not change any more

Explain the concept of hierarchical clustering (in contrast to K-means clustering).

  • In contrast to K-means clustering: you don't have to define a number of clusters (K) in advance.
  • --> Disadvantage: outcome is more complex:
    • not just exclusive groups of samples
    • but hierarchical structure of sample groups
    • --> visualized in dendrograms

Hierarchical clustering algorithms can be classified by the way they calculate distances between groups of objects. Name three subgroups.

  • Single linkage: distance between groups = distance between nearest objects from both groups
  • Complete linkage: distance between groups = distance between furtherst objects from both groups
  • Average linkage: distance between groups = average of distances of all pairs of objects from both groups (UPGMA)


*this classification is independent of the distance metric that is used!

Describe a simple procedure to identify biomarkers among RNA expression levels.

  • Identify whether epxression values are statistically significantly correlated with the disease status of interest
  • If there are two classes: t-test
  • Multiple classes (benign, stage 1, stage 2) --> ANOVA

Describe a method to build a classifier.

  • Generate a decision tree
  • Each decision point in a decision tree is constructed from a split-point based on a variable
  • Impure? --> add more split points (eg start with gene 1, then further down look at gene 2)

What does the Gini impurity describe?

It can be interpreted as the average fraction of wrongly labelled objects if the class labels present in the original set were randomly distributed over the objects.

How is the Gini impurity of two sets calculated? And the decrease of Gini impurity after introducing a new split point?

The Gini impurity of two sets A and B of objects, G(A, B) is defined as the weighted sum of Gini impurities of the individual sets, as defined above. The weight is proportional to the fraction of objects in a set relative to all sets.

(See screenshot)

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo