When we write a post on WordPress.com (e.g. this one), we may classify it into one or more user defined categories, which are organized into a tree structure. The user defined categories illustrate a (small) taxonomy on the things that a blogger is interested in. Large taxonomies such as Wikipedia categories and WordNet provide us a more complete classification of things or concepts, as well as to the principles underlying such a classification. WordPress also allows us to attach any tags to posts while tags don’t expose a hierarchical structure. The categories and tags help us to grasp the topics of documents quickly. They also help us to find information. For instance, we can search posts by tags in WordPress.com. Because the categories and tags are machine accessible, they are also of great use in information retrieval, knowledge management, natural language processing, etc. Of the massive amount of text available on the internet, unfortunately, not many documents are classifies or tagged. Clearly, the automatic learning of taxonomy and document topics has huge value in the era of cloud.

Although it is possible to learn a taxonomy from the internet by automatic means, it is extremely challenging. On the other hand, tagging or topic detection is more computationally tractable. Today, we will look into several methods for topic detection from the point view of keyword extraction. Before diving into keyword extraction, it is worth of mentioning that other approaches are also available for topic detection. For example, document classification, assigning a document to one or more classes, can serve the purpose too. However, this approach has many limitations. First of all, it is not flexible as the classes are predefined. Given the quick evolving pace of our world, new classes have to been added frequently and the models have to be trained again and again. Second, current machine learning techniques are not effective or efficient to handle a very large number of classes, which is a usual situation in topic detection. Third, document classification, a supervised learning approach, requires training datasets. The larger the training data, the better. But labeling large training data is labor intensive and of high cost. Besides, human labeling is biased and subjective. It is hard for people to agree upon all the labels/tags of training data.

The goal of keyword extraction is to automatically extract relevant terms from a document. Since it is about relevance, you will probably think of TF–IDF (term frequency–inverse document frequency) immediately. Yes, TF-IDF has been used for keyword extraction as it is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. There are many variations of TF-IDF functions, e.g. BM25F, a state-of-the-art one, widely used in document retrieval such as web search.

When a TF-IDF function is run against all words in a document, the words can be ranked by their scores. A higher score indicates that a word is more relevant to the document and thus provides a good heuristic to determine likely keyword candidates. Due to its simplicity, it is widely used for keyword extraction today. However, the approach of TF-IDF requires a corpus, which may be too small, or be biased, or even doesn’t exist in real world applications. Meanwhile, social media, a majority body of text in the cloud ear, are often short or very noisy. These also present challenges to the TF-IDF approach.

We are particularly interested in domain-independent keyword extraction, which does not require a large corpus. This is a more natural approach as human doesn’t need a corpus to determine the keywords of documents. Matsuo and Ishizuka proposed such a method based on word co-occurrence statistical information where two terms in a sentence are considered to co-occur once. Their method is based on the observation that a general term is used relatively impartially with each frequent term, while significant terms show co-occurrence with particular terms. Therefore, the degree of bias of co-occurrence can be used as an indicator of term importance. The degree of co-occurence bias is measured by the χ2 statistic. In short, the algorithm selects frequent terms as a “standard” and then extracts terms with high deviation from them as keywords. To make the algorithm more stable and effective, the authors also proposed a refined χ2 value and clustering terms based on Jensen-Shannon divergence or mutual information. Overall, the algorithm consists of six steps:

  1. Stem words by Porter algorithm and extract phrases based APRIORI algorithm. Discard stop words.
  2. Select the top frequent terms up to 30% of running terms.
  3. Clustering frequent terms. Two terms are in the same cluster if either their Jensen-Shannon divergence or mutual information is above the threshold.
  4. Calculate the expected co-occurrence probability
  5. Calculate the refined χ2 values that removes the maximal term.
  6. Output a given number of terms of largest refined χ2 values.

It was shown that this method achieves comparable performance to TD-IDF.