FAQTs - Knowledge Base - View Entry - Getting tables from HTML

faqts : Computers : Programming : Languages : Python : Snippets : Web Programming / Manipulating HTML files

+ Search

Entry

Getting tables from HTML

Jan 2nd, 2003 16:37
Brian Thomas, Nathan Wallace, unknown unknown, Hans Nowak, Snippet 135, Magnus L. Hetland

"""
Packages: text.html
"""
"""
> I would like to write a Python script that would read an HTML document and
> extract table contents from it. Eg. each table could be a list of tuples
> with data from the rows. I thought htmllib would provide the basic tools
> for this, but I can't find any example that would be of use. 
> 
> So - does anyone have a Python snippet that looks for tables and gets at
> the data?
I know there have been several responses -- but as a compulsive
minimalist, I just couldn't resist trying to make a small solution...
"""
# ------ start table parser ------
from re import compile, findall, I, S
flags = I+S
tpat = compile("<table[^>]*>.*?</table>",flags)
rpat = compile("<tr[^>]*>.*?</tr>",flags)
dpat = compile("<td[^>]*>(.*?)</td>",flags)
data = open("data.html").read()
result = []
for table in findall(tpat,data):
    result.append([])
    for row in findall(rpat,table):
        result[-1].append([])
        for cell in findall(dpat,row):
            result[-1][-1].append(cell)
        result[-1][-1] = tuple(result[-1][-1])
# ------- stop table parser -------