Home / Summaries / Class notes - Deep Learning in Python / words-vector-values

Nlp lecture

17 important questions on Nlp lecture

What are drawbacks of using tf-idf values to encode documents?

The vocabulary size becomes very large: every words get's it's own column resulting in approximately 30.000 columns
the documents often have very sparse vectors: there are a lot of zero's in the vectors because the majority of the words in the vocuabulray do not occur in the document.
it cannot be used to recognize synonyms. And positional information of the words is lost.

“The dog ate its food”
“The canine consumed the meal served to it”

Next to tf-idf values, how can words and documents be numerically encoded?

Dummy coding: the dictionary is represented as a vector of positions, as much positions as there are words. The words are then encoded by a vector which has 0's for all words the word is not and a 1 for the word it is. It encodes a position in the dictionary
this is also referred to as one-hot encoding
the vector for one word can look like: (0,0,0,0,0,0,1,0,0,0)

What are some drawbacks of encoding words through dummy coding/one-hot encoding?

Again, each word is represented by a very long vector and most of them are zeros
this encoding is arbitrary: the position of the one in the vector just depends on where the word is in the dictionary. The position of the one thus doesn't convey information about the word.
there cannot be a similarity measure on these vectors because the similarity is always 0 unless it is the same word, then it becomes one.

What is another alternative method of encoding documents?

Through the bag of words model: This is just the matrix that you would perform the tf-idf calculation on.
the vector that represents the words is now the vector of values in the column.
the documents are then encoded as a vector of vectors.

What is an advantage and disadvantage of the bag of words model?

This method encodes similarity of words through co-occurence in documents.
however, you need many documents to obtain good representations

A problem of these word vector encodings is that the dimensionality becomes very large. What is an alternative way of encoding words that overcomes this shortcoming?

Latent semantic analysis: This technique works by doing a pca on the document term matrix: the matrix of words on the columns and documents on the rows. (this can also be done on the tf-idf matrix).
the amount of principal components you select is the dimensions that the word vector has.
the loadings on the principal components of each word become the word vector. The vector of a word is called a word embedding.

How do word embeddings solve problems of the earlier discussed ways of encoding words numerically?

Word vector embeddings of similar words are very similar. They thus capture semantic meaning of the words much better.

for bag of words for example, the word vectors of similar words are often not similar at all.

the word vectors are lower in dimensions and are dense: there are as much dimensions as you've picked and they are non-zero values instead of predominantly zeros.

What is an important property of word2vec word embeddings?

Analogies can be calculated using word embeddings.
e.g. King : man ~ ... : woman can be calculated by:
wking - wman + wwoman = wqueen

here w is the word embedding of the word

the relationships that words have with eachother is encoded in the relationships of the word embeddings.

Why are recurrent neural networks used for language processing and not regular artificial neural networks?

Artificial neural networks are unable to encode the sequence of inputs.
for language, the sequence of the sentence is essential.
the recurrent neural network is able to encode sequential information

it is thus used for other sequential modelling tasks as well

also, recurrent neural networks can handle different amounts of inputs whereas artifical neural networks always require the same input size. This is why rnn's can handle sentences of different lengths.

What is one extension of the recurrent neural network?

The multilayer recurrent neural network.

The output of the initial processing of the word can get processed further to obtain a more processed output Ol
At any stage, the intermediate output can get passed on to the next stage of processing.

Now that you understand how a single instance of processing is done in lstm, how does it process an entire sentence?

The lstm, similarly to the rnn, performs the same calculations over and over again and feeds through information from previous processes.
The ltm and stm are continuously passed forward and the input is the word we are currently processing. The input thus moves along with the sentence.
the updated ltm and updated stm serve as the starting ltm and stm for the processing of the next word.

How is the exploding/vanishing gradient problem avoided in lstm?

Because separate paths are used to update short term memory and long term memory. This allows us to use longer series of data without suffering from the exploding/vanishin gradients problem.

How does a gated recurrent units model (gru model) work?

The gru is very similar to the lstm but with less gates, making it easier to train but also less powerfull.
the gru has a reset gate and an update gate.
the update gate processes the new input and creates new information to remember.
the reset gate determines how much of past information can be forgotten.

How many sets of weights are there to calculate the self-attention values?

the weights for queries, keys and values are shared between words. We thus have three sets of weights; one for queries, one for keys and one for values.
no matter how long the sentence is, the amount of weights stays the same.

What is multihead attention?

The calaculation process of the sentence to obtain self-attention values for each word is done in slightly different manners in parallel. This causes the encoding of the words in context to eachother to have a higher dimensionality.

How is the final input for a transformer created?

The position-semantic embedding values are added to the self attention values to obtain residual connection values.
these embeddings provide the finalized input for the transformer.
these embeddings contain information from three categories: the semantic of the word, the position of the word, and the relationship from the word to all other words.

Summarize how transformers work.

Transformers use word embeddings to convert words into numbers.
positional encoding keeps track of order
self-attention is used to keep track of word relationships within a sentence within input and output
encoder-decoder-attention is used to keep track of word relationships between input and output.

The question on the page originate from the summary of the following study material:

Deep Learning in Python

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Nlp lecture

17 important questions on Nlp lecture

What are drawbacks of using tf-idf values to encode documents?

Next to tf-idf values, how can words and documents be numerically encoded?

What are some drawbacks of encoding words through dummy coding/one-hot encoding?

What is another alternative method of encoding documents?

What is an advantage and disadvantage of the bag of words model?

A problem of these word vector encodings is that the dimensionality becomes very large. What is an alternative way of encoding words that overcomes this shortcoming?

How do word embeddings solve problems of the earlier discussed ways of encoding words numerically?

What is an important property of word2vec word embeddings?

Why are recurrent neural networks used for language processing and not regular artificial neural networks?

What is one extension of the recurrent neural network?

Now that you understand how a single instance of processing is done in lstm, how does it process an entire sentence?

How is the exploding/vanishing gradient problem avoided in lstm?

How does a gated recurrent units model (gru model) work?

How many sets of weights are there to calculate the self-attention values?

What is multihead attention?

How is the final input for a transformer created?

Summarize how transformers work.

Summaries related to Book .4

Class notes - Deep Learning in Python

Cognitive Psychology

Research Methods in Psychology Evaluating a …

Psychology A Concise Introduction

An Introduction to Developmental Psychology

Abnormal Psychology

Statistics The Art and Science of Learning f…

Organizational Behavior

A conceptual introduction to psychometrics

Electrical Engineering: Concepts and Applica…

Class notes - scientific and statistical rea…

Class notes - Psychological assessment