Ryan Schram Yawahigu ana amwahao Ol rot bilong laip bilong mi (or, Curriculum vitae)
Wiki mining: An idea for an automatic topic identification system for Dokuwiki
I have always wanted a system that would automatically identify important words in a Dokuwiki wiki article and add tags to the page with these words. My latest idea is to create a system of plugins that can do this for me. The system would consist of an action plugin that adds to the indexer web bug and stores the keywords as page metadata and a render plugin that converts stored keywords into tags, and then adds the tags to the page (perhaps with an existing plugin for manually adding tags to articles). I am not sure how to implement this myself and so I’m asking for help from anyone who can help me develop the design for this system and create the plugins to implement it.
The problem: It’s easier to add to a wiki than to consolidate and extend what’s there
I have been using Dokuwiki (DW) to run a small wiki for my class notes, lecture outlines, and other presentations for several years. It’s great to be able to go back to old lecture notes and reuse them in other lectures or expand them into longer guides to a topic. I find it it is very easy to add new articles, but I find it is harder to pull together what I have accumulated. DW provides many different ways to do this, for instance, by adding tags to articles so that articles on the same topic can be grouped together.
The problem with this and similar methods of organizing material in a wiki is that it requires one to first decide on how to organize the material and then manually add information (e.g. tags, wiki links) to the pages. It would be great if the software itself could analyze the information and automatically suggest categories or other relationships among articles. These might need to be edited later, but if a system automatically found the topics of an article were, and then what other articles cover these topics, that would lower the barrier to making edits in the wiki to consolidate the materials. It would be like a prompt to action. Once you finish one article, the software would immediately suggest another edit to make to connect that article to other articles.
To make this happen, one would need to add a plugin to DW to do some kind of text mining on articles to find their topics. DW already has a full text index for its search engine, so a lot of the raw data that one needs for text mining is already stored for each article. The plugin I’d like to create would extend some of the existing indexing functions.
Part 1: Finding keywords by generating a TF*IDF vector for a page during indexing
DW pages include a web bug that triggers several background jobs when
a page is loaded, including updating the full text index.
When the bug adds new index entries, it triggers the event INDEXER_PAGE_ADD
.
At this point, I would like a plugin to tokenize the page content and
for each word, calculate the TF*IDF
value, and then store the term-to-TF*IDF associative array as a metadata entry. Based on my
limited understanding of DW, this might go like this:
- Get the total number of articles (or, documents) in the wiki (TD).
- Get a count of total words (or, terms) in the article (TW).
- Invoke the existing
Indexer->getPageWords($text)
function, which returns an array of terms and frequencies in the article. - For each term in this array,
- divide the term’s frequency in the article by the total words in the article. This is the raw term frequency (RTF) for the term.
- calculate the relative term frequency (TF) for this term, which is RTF/TW.
- look up the term in the existing full text index pages (using
existing functions that do this) and retrieve the term’s index line.
- explode the line into
PID*<frequency>
pairs. - count the number of article IDs referenced in the line. The number of articles with the term in it at least once is the document frequency (DF).
- Calculate the natural log of TD/DF. This the inverse document frequency (IDF).
- explode the line into
- Calculate the TF*IDF for the term.
- Save the new associative array of terms and their TF*IDF values in
the metadata of the article (using
p_set_metadata
[I think] and creating a new key in the metadata data structure).
The terms in this array with the highest TF*IDF values are, we hope, the most important words in the article and can be used as key words. I believe that the existing tokenize functions used by the DW indexer filter stop words, but in fact stop words don’t need to be filtered. These words are common in both the document and in all documents, so they even if the TF is high, they will always have a low TF*IDF value.
I have not really given a thought to efficiency. For instance, when
INDEXER_PAGE_ADD
is triggered, the page is already going to
be tokenized and the term and raw term frequency mappings are stored
temporarily. I don’t know how to make use of that, so I foresee doing
two tokenizations. This is an area where I could use help on design.
Part 2: Render a new markup element as a list of links to the terms
Now when the full text index is updated, each page will also have an
array of its terms TF*IDF values which can potentially be used in a
variety of ways. One way to use this would be to sort through this
term-to-TF*IDF array, find a certain number of the highest weighted
terms (say, the top five), and list these as wikilinks to pages where
one could gather a list of other pages with inbound links to the same
term. I have thought less about this part of the system. I think one
could create a render plugin for a new markup element,
e.g. ~~KEYWORDS~~
which would return key words with links,
much like tag plugins create links for manually entered tags for a
page.
Another application of TF*IDF vectors
The new metadata term-to-TFIDF array could be used to find “similar” articles. In text mining, a document can be represented as a vector of its terms’ TF*IDF. Each of the TF*IDF values is a coordinate in a multidimensional space where each term is a dimension that goes from low importance to high importance. Documents with terms that have the same TFIDF weight in each document are located at the same point on that dimension. Hence, the angle between these the document vectors is also a measure of how close they are, that is, how many important words they have in common, or perhaps more precisely how close in importance their common terms are. This can be represented as the cosine of the angle between two document vectors.
If each page has a TF*IDF vector stored in metadata, then another plugin could calculate its cosine similarity with other articles’ vectors, rank them by their similarity, and store all or part of this sorted list of articles. This could then be accessed by another render plugin to display a list of similar articles in the wiki.
If you have any suggestions about how I should pursue this, or people I should ask for help, please let me know.