Sunday, February 12, 2006

Why is there a size limit on file uploads?

A potential problem with form processing scripts is that, by default, they attempt to process form POSTings no matter how large they are. A wily hacker could attack your site by sending a huge POST of many megabytes. The script will attempt to read the entire POST thus growing hugely in size until it runs out of memory. While the script attempts to allocate the memory the system may slow down dramatically. This is a form of denial fo service attack.

Another possible attack is for the remote user to force the script to accept a huge file upload. The script will accept the upload and store it in a temporary directory even if your script doesn't expect to receive an uploaded file. The file will be deleted automatically when the script terminates, but in the meantime the remote user may have filled up the server's disk space, causing problems for other programs.

The best way to avoid denail of service attacks is to limit the amount of memory, CPU time and disk space that the scripts can use. Some Web servers come with built-in facilities to accomplish this. In other cases, you can should use commands to put ceilings on resource usage.

Most servers try to avoid denial of service attacks by limiting resource usage and so there is a size limit on file uploads.

Saturday, February 11, 2006

Building dictionary from the web

Initially a small collection of "seed" texts are fed to the crawler (a few hundred words of running text have been sufficient in practice). Queries combining words from these texts are generated and passed to the Google API which returns a list of documents potentially written in the target language. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques bootstrapped from the initial seed texts (and refined as more texts are added to the database) is used to determine which documents (or sections thereof) are written in the target language. The crawler then recursively follows links contained within documents that are in the target language. When these run out, the entire process is repeated, with a new set of Google queries generated from the new, larger corpus.