Entry
Getting tables from HTML
Jan 2nd, 2003 16:37
Brian Thomas, Nathan Wallace, unknown unknown, Hans Nowak, Snippet 135, Magnus L. Hetland
"""
Packages: text.html
"""
"""
> I would like to write a Python script that would read an HTML document and
> extract table contents from it. Eg. each table could be a list of tuples
> with data from the rows. I thought htmllib would provide the basic tools
> for this, but I can't find any example that would be of use.
>
> So - does anyone have a Python snippet that looks for tables and gets at
> the data?
I know there have been several responses -- but as a compulsive
minimalist, I just couldn't resist trying to make a small solution...
"""
# ------ start table parser ------
from re import compile, findall, I, S
flags = I+S
tpat = compile("<table[^>]*>.*?</table>",flags)
rpat = compile("<tr[^>]*>.*?</tr>",flags)
dpat = compile("<td[^>]*>(.*?)</td>",flags)
data = open("data.html").read()
result = []
for table in findall(tpat,data):
result.append([])
for row in findall(rpat,table):
result[-1].append([])
for cell in findall(dpat,row):
result[-1][-1].append(cell)
result[-1][-1] = tuple(result[-1][-1])
# ------- stop table parser -------