TF-IDF keyword extraction

For my masters thesis I have researched Content-Based Recommendation and keyword extraction as a part of it in detail. One interesting point regarding to keyword extraction is the “quality” of keywords, meaning that if a keyword is descriptive for a document or not.

TF-IDF Definition

For TF-IDF keyword extraction it is necessary to have a large “item base” that contains many keywords. Lets consider you have a set of documents and want to measure the “similarity” to each other of those documents. The more documents your item base contains, the more reliable is the quality statement for a single keyword. You would take a (sub)set of the documents keywords and count the number of overlapping keywords, for example. But how to define if a keyword is descriptive or not? Well, using TF-IDF.

TF-IDF is defined as follows:


TF = ni ⁄ n

IDF = log10 (  I  ⁄ m )

where ni is the number of keyword i within a document, n the total number of keywords of a document, I the total number of documents in your item base and m the number of documents that contain i at least once.

TF-IDF Example

Lets demonstrate this with an example. Lets define the following keyword vectors:

D = {This, is, an, example}, P = {The, weather, is, rainy, today}, Q = {Hello, World}

  • keyword “this” exists once in D: ni = 1
  • number of keywords in D: 4
  • We have an total size of documents of three: I = {D, P, Q}
  • number of documents that contain “this” at least once: m = 2

Then, the following calculation results:

TF-IDF (“this”) = 1 / 4 / log10 (  3  ⁄ 2 ) = 0,25 / 0,176 = 1,41

The higher the resulting value, the more qualitative the keyword. Applying this calculation to all keywords of all documents would result in a large list of keywords associated to items. You can then define a threshold that have to be exceeded in order to get considered as a “keyword”.