Natural Language Processing

Q \& A#

What is bag of words? How we can use it for text classification?#

Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature.

What are the advantages and disadvantages of bag of words?#


  1. Simple to understand and implement.


  1. The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
  2. Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons
  3. Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”).

What are N-grams? How can we use them?#

The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence.

What is TF-IDF? How is it useful for text classification?#

Term Frequency (TF) is a scoring of the frequency of the word in the current document. Inverse Document Frequency(IDF) is a scoring of how rare the word is across documents. It is used in scenario where highy recurring words may not contain as much informational content as the domain specific words. For example, words like “the” that are frequent across all documents therefore need to be less weighted. The Tf-IDF score highlights words that are distinct (contain useful information) in a given document.