Wednesday, June 28, 2006

Search for Selected Text

Highlight text in any web page and search for it on Google with the click of a button.

Start by creating a new button on your browser's links bar and naming it "Search Google" or something similar.

Next, if you’re using Mozilla, Firefox, or Netscape, place this JavaScript code (all on one line) into the button’s location field:
javascript: s = document.getSelection( );
for(i = 0; i < frames.length; i++) {
s = frames[i].document.getSelection( );
}
location.href = 'http://www.google.com/search?q='+escape(s);
If you’re using Internet Explorer, you’ll need this code:
javascript: s = (document.frames.length ? ' ':
document.selection.createRange( ).text);
for(i = 0; i < document.frames.length; i++) {
s=document.frames[i].document.selection.createRange( ).text;
}
location.href = 'http://www.google.com/search?q='+escape(s);
Next, open any web page, highlight some text, and click your new search button. The next thing you’ll see is a search results page with items matching your selected text.

Bug: This doesn't work when there are frames in the webpage.

Monday, May 01, 2006

Converting FAT volumes to NTFS

The convert command converts FAT volumes to NTFS. This command is run from the command prompt.

ex: convert D: /FS:NTFS - will convert D: from FAT to NTFS

Saturday, April 22, 2006

Getting rid of java.lang.OutOfMemoryError

Most JVMs allocate memory on the heap for almost everything, except for reflective data. That is put in a separate location which is a section of the heap that is reserved for permanent generation. This gets easily filled up when you dynamically load an unload classes, or have a large number of classes. Simply add the following options to the java executable and all worries will be gone:
-XX:PermSize=256M -XX:MaxPermSize=256M
The entire command looks like this:
java -server -Djava.awt.headless=true -Xms1024M -Xmx1024M -XX:PermSize=256M -XX:MaxPermSize=256M
Read This

Wednesday, March 29, 2006

Print system information under Linux

'uname' prints information about the machine and operating system it is run on.
% uname -a

Linux localhost.localdomain 2.4.20-8 #1 Thu Mar 13 17:18:24 EST 2003 i686 athlon i386 GNU/Linux

Use this command if you want to know any of the following information
kernel name
network node hostname
kernel release
kernel version
machine hardware name
processor type
hardware platform
operating system

If you are interested in knowing information about the kernel distribution then use `cat /etc/issue` The output of which will be something as below (on Red Hat ofcourse)
Red Hat Linux release 9 (Shrike)
Kernel \r on an \m

Thursday, March 09, 2006

Proxy Settings - Enviroment Variables

Proxy can be set up using environment variables automatically for all users by creating the following file,

/etc/profile.d/proxyenv.sh
http_proxy="192.168.36.204:8080"
https_proxy="192.168.36.204:8080"
ftp_proxy="192.168.36.204:8080"

export http_proxy, https_proxy, ftp_proxy

If you are using csh instead of bash then create the following file,

/etc/profile.d/proxyenv.csh
setenv http_proxy="192.168.36.204:8080"
setenv https_proxy="192.168.36.204:8080"
setenv ftp_proxy="192.168.36.204:8080"

Reference: Unofficial ISS Linux Web Pages

Converting HTML to text

I have used the following two scripts and found them to be not very impressive.

html2text (Python script)
Html2text (Perl script)

The Python script converts a HTML page into Markdown (a text-to-HTML format) which I don't want. I want text only.

The Perl script requires the input to be "normalized" by a program such as sgmlnorm before it could process it. Apart from this, the script doesn't work well for all the documents. It is limited to certain tags and has to be modified to get it to work for other tags. The text between the tags that are not handled just vanish from the output. Lets see if I can modify it to work for my documents atleast.

Sunday, February 12, 2006

Why is there a size limit on file uploads?

A potential problem with form processing scripts is that, by default, they attempt to process form POSTings no matter how large they are. A wily hacker could attack your site by sending a huge POST of many megabytes. The script will attempt to read the entire POST thus growing hugely in size until it runs out of memory. While the script attempts to allocate the memory the system may slow down dramatically. This is a form of denial fo service attack.

Another possible attack is for the remote user to force the script to accept a huge file upload. The script will accept the upload and store it in a temporary directory even if your script doesn't expect to receive an uploaded file. The file will be deleted automatically when the script terminates, but in the meantime the remote user may have filled up the server's disk space, causing problems for other programs.

The best way to avoid denail of service attacks is to limit the amount of memory, CPU time and disk space that the scripts can use. Some Web servers come with built-in facilities to accomplish this. In other cases, you can should use commands to put ceilings on resource usage.

Most servers try to avoid denial of service attacks by limiting resource usage and so there is a size limit on file uploads.

Saturday, February 11, 2006

Building dictionary from the web

Initially a small collection of "seed" texts are fed to the crawler (a few hundred words of running text have been sufficient in practice). Queries combining words from these texts are generated and passed to the Google API which returns a list of documents potentially written in the target language. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques bootstrapped from the initial seed texts (and refined as more texts are added to the database) is used to determine which documents (or sections thereof) are written in the target language. The crawler then recursively follows links contained within documents that are in the target language. When these run out, the entire process is repeated, with a new set of Google queries generated from the new, larger corpus.

ref:
http://borel.slu.edu/crubadan/