Entry
How can I extract information from other web sites using Python?
Are there any tools to help me process HTML using Python?
Mar 27th, 2000 15:32
Steve Holden, Nathan Wallace, Siggy Brentrup
I can recommend a few. First, a tutorial by Josh Cogliati.
http://www.honors.montana.edu/~jjc/easytut/
Second, two short ducuments written by Magnus Lie Hetland; a tutorial
on Python and an introduction to programming. You can reach them through
http://www.idi.ntnu.no/~mlh/python/
Here is something I have used to do that sort of thing. Check out
the SGMLParser module:
http://www.python.org/doc/current/lib/module-sgmllib.html
Actual code to extract plain text from a web page is shown below,
and shows how the HTMLParser (a subclass of SGMLParser) can be
used with a DumbWriter. This code is due to Siggy Brentrup:
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(YOUR_DATA)
Instatiated w/o arguments, DumbWriter writes to stdout.
Here YOUR_DATA can be the result of using the urllib module,
which allows you to read a remote HTML page almost as easily
as a local file.