Saturday, February 11, 2006

Building dictionary from the web

Initially a small collection of "seed" texts are fed to the crawler (a few hundred words of running text have been sufficient in practice). Queries combining words from these texts are generated and passed to the Google API which returns a list of documents potentially written in the target language. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques bootstrapped from the initial seed texts (and refined as more texts are added to the database) is used to determine which documents (or sections thereof) are written in the target language. The crawler then recursively follows links contained within documents that are in the target language. When these run out, the entire process is repeated, with a new set of Google queries generated from the new, larger corpus.

ref:
http://borel.slu.edu/crubadan/

4 comments:

Nitin Reddy Katkam said...

Interesting....

BTW, what do search engines use to determine if a web page is of a relevant subject?

Nazeer said...

subject?? I think you meant language.

Well, I don't know how Google determines the language of a particular page but I am using TextCat to determine the langauge.

Right now, TextCat doesn't work for all the languages and all the encodings but I think it can be tweaked.

Nitin Reddy Katkam said...

No, I meant subject. There are many subject-specific search engines out there... there's a new search engine called Kosmix that let's you search for pages under a particular category (Health, Politics and Travel). I was just wondering how they would crawl and index their pages.

Nazeer said...

There are tools that do that. We will be soon adding the categorizing feature to the autovita project. You will be able to see that when that goes online.