faqts : Computers : Programming : Languages : Python : Snippets : Web Programming / Manipulating HTML files

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

12 of 13 people (92%) answered Yes
Recently 8 of 9 people (89%) answered Yes

Entry

Getting tables from HTML

Jan 2nd, 2003 16:37
Brian Thomas, Nathan Wallace, unknown unknown, Hans Nowak, Snippet 135, Magnus L. Hetland


"""
Packages: text.html
"""
"""
> I would like to write a Python script that would read an HTML document and
> extract table contents from it. Eg. each table could be a list of tuples
> with data from the rows. I thought htmllib would provide the basic tools
> for this, but I can't find any example that would be of use. 
> 
> So - does anyone have a Python snippet that looks for tables and gets at
> the data?
I know there have been several responses -- but as a compulsive
minimalist, I just couldn't resist trying to make a small solution...
"""
# ------ start table parser ------
from re import compile, findall, I, S
flags = I+S
tpat = compile("<table[^>]*>.*?</table>",flags)
rpat = compile("<tr[^>]*>.*?</tr>",flags)
dpat = compile("<td[^>]*>(.*?)</td>",flags)
data = open("data.html").read()
result = []
for table in findall(tpat,data):
    result.append([])
    for row in findall(rpat,table):
        result[-1].append([])
        for cell in findall(dpat,row):
            result[-1][-1].append(cell)
        result[-1][-1] = tuple(result[-1][-1])
# ------- stop table parser -------