For my masters thesis I have researched Content-Based Recommendation and keyword extraction as a part of it in detail. One interesting point regarding to keyword extraction is the “quality” of keywords, meaning that if a keyword is descriptive for a document or not.
TF-IDF Definition
For TF-IDF keyword extraction it is necessary to have a large “item base” that contains many keywords. Lets consider you have a set of documents and want to measure the “similarity” to each other of those documents. The more documents your item base contains, the more reliable is the quality statement for a single keyword. You would take a (sub)set of the documents keywords and count the number of overlapping keywords, for example. But how to define if a keyword is descriptive or not? Well, using TF-IDF.
TF-IDF is defined as follows:
TF-IDF = TF * IDF
TF = ni ⁄ n
IDF = log10 ( I ⁄ m )
where ni is the number of keyword i within a document, n the total number of keywords of a document, I the total number of documents in your item base and m the number of documents that contain i at least once.
TF-IDF Example
Lets demonstrate this with an example. Lets define the following keyword vectors:
D = {This, is, an, example}, P = {The, weather, is, rainy, today}, Q = {Hello, World}
- keyword “this” exists once in D: ni = 1
- number of keywords in D: 4
- We have an total size of documents of three: I = {D, P, Q}
- number of documents that contain “this” at least once: m = 2
Then, the following calculation results:
TF-IDF (“this”) = 1 / 4 / log10 ( 3 ⁄ 2 ) = 0,25 / 0,176 = 1,41
The higher the resulting value, the more qualitative the keyword. Applying this calculation to all keywords of all documents would result in a large list of keywords associated to items. You can then define a threshold that have to be exceeded in order to get considered as a “keyword”.