HTML Table Extractor
HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table
Important links
Installation
pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor
Usage
Example 1 - Simple
12
34
from html_table_extractor.extractor import Extractor
table_doc = """
1 | 2 |
3 | 4 |
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
It will print out:
[[u'1', u'2'], [u'3', u'4']]
Example 2 - Transformer
12
34
from html_table_extractor.extractor import Extractor
table_doc = """
1 | 2 |
3 | 4 |
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()
It will print out:
[[1, 2], [3, 4]]
Example 3 - Pass BS4 Tag
12
34
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
1 | 2 |
3 | 4 |
not wanted |
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()
It will print out:
[[u'1', u'2'], [u'3', u'4']]
Example 4 - Complex
1
2
3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
1 | 2 | 3 |
4 | ||
5 |
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
It will print out:
[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]
Example 5 - Conflicted
1
2
3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
1 | 2 | 3 |
4 | ||
5 |
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
It will print out:
[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]
Example 6 - Write to file
12
34
from html_table_extractor.extractor import Extractor
table_doc = """
1 | 2 |
3 | 4 |
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')
It will write to a given path and create a new csv file called output.csv:
1,2
3,4
Team
Errors/ Bugs
If something is not working correctly, or if you have any suggestion on improvements, report it here
Copyright
Copyright (c) 2017 Justin Li. Released under the MIT License
Third-party copyright in this distribution is noted where applicable.
Misc
How to upload the package to pypi (for the reference of the owner)
python setup.py bdist_wheel --universal
twine upload dist/* --verbose