faqts : Computers : Programming : Languages : Python : Common Problems : Web Programming

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

7 of 14 people (50%) answered Yes
Recently 5 of 10 people (50%) answered Yes

Entry

How can I extract information from other web sites using Python?
Are there any tools to help me process HTML using Python?

Mar 27th, 2000 15:32
Steve Holden, Nathan Wallace, Siggy Brentrup


I can recommend a few. First, a tutorial by Josh Cogliati.
  http://www.honors.montana.edu/~jjc/easytut/
Second, two short ducuments written by Magnus Lie Hetland; a tutorial
on Python and an introduction to programming. You can reach them through
  http://www.idi.ntnu.no/~mlh/python/
Here is something I have used to do that sort of thing.  Check out
the SGMLParser module:
  http://www.python.org/doc/current/lib/module-sgmllib.html
Actual code to extract plain text from a web page is shown below,
and shows how the HTMLParser (a subclass of SGMLParser) can be
used with a DumbWriter.  This code is due to Siggy Brentrup:
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(YOUR_DATA)
Instatiated w/o arguments, DumbWriter writes to stdout.
Here YOUR_DATA can be the result of using the urllib module,
which allows you to read a remote HTML page almost as easily
as a local file.