Thursday, March 09, 2006

Converting HTML to text

I have used the following two scripts and found them to be not very impressive.

html2text (Python script)
Html2text (Perl script)

The Python script converts a HTML page into Markdown (a text-to-HTML format) which I don't want. I want text only.

The Perl script requires the input to be "normalized" by a program such as sgmlnorm before it could process it. Apart from this, the script doesn't work well for all the documents. It is limited to certain tags and has to be modified to get it to work for other tags. The text between the tags that are not handled just vanish from the output. Lets see if I can modify it to work for my documents atleast.


Nazeer said...

HTML Tidy Library Project is an excellent HTML Tidy program.

Nazeer said...

On Linux, the best way is to use lynx or links.

lynx -dump <url>
links -dump <url>