TF-IDF keyword extraction

February 13, 2018 (Updated December 21, 2024) 5 min read · Data Structures, Machine Learning, Php, Softwaredevelopment

For my masters thesis I have researched Content-Based Recommendation and extracting keywords – namely: TF-IDF keyword extraction – as a part of it in detail. One interesting point regarding to keyword extraction is the “quality” of keywords, meaning that if a keyword is descriptive for a document or not.

TF-IDF Definition

For TF-IDF keyword extraction it is necessary to have a large “item base” that contains many keywords. Lets consider you have a set of documents and want to measure the “similarity” to each other of those documents. The more documents your item base contains, the more reliable is the quality statement for a single keyword. You would take a (sub)set of the documents keywords and count the number of overlapping keywords, for example. But how to define if a keyword is descriptive or not? Well, using TF-IDF.

TF-IDF is defined as follows:

TF-IDF = TF * IDF

TF = n_i ⁄ n

IDF = log₁₀ ( I ⁄ m )

where n_i is the number of keyword i within a document, n the total number of keywords of a document, I the total number of documents in your item base and m the number of documents that contain i at least once.

TF-IDF Example

Lets demonstrate this with an example. Lets define the following keyword vectors:

D = {This, is, an, example}, P = {The, weather, is, rainy, today}, Q = {Hello, World}

keyword “this” exists once in D: n_i = 1
number of keywords in D: 4
We have an total size of documents of three: I = {D, P, Q}
number of documents that contain “this” at least once: m = 2

Then, the following calculation results:

TF-IDF (“this”) = 1 / 4 / log₁₀ ( 3 ⁄ 2 ) = 0,25 / 0,176 = 1,41

The higher the resulting value, the more qualitative the keyword. Applying this calculation to all keywords of all documents would result in a large list of keywords associated to items. You can then define a threshold that have to be exceeded in order to get considered as a “keyword”.