Beautiful Soup
Template:Other uses {{#invoke:Infobox|infobox}}
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML[1], which is useful for web scraping.[2]
It is available for Python 2.7 and Python 3.
Code example
<source lang="python">
- !/usr/bin/env python3
- Anchor extraction from HTML document
from bs4 import BeautifulSoup from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
soup = BeautifulSoup(response, 'html.parser') for anchor in soup.find_all('a'): print(anchor.get('href', '/'))
</source>
Advantages and Disadvantages
This table summarizes the advantages and disadvantages of each parser library[2]
Parser | Typical usage | Advantages | Disadvantages |
---|---|---|---|
Python’s html.parser | BeautifulSoup(markup, "html.parser") |
|
|
lxml’s HTML parser | BeautifulSoup(markup, "lxml") |
|
|
lxml’s XML parser |
BeautifulSoup(markup, "lxml-xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
Release
Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012.
The current release is Beautiful Soup 4.8.2[1] (December 24, 2019)
You can install Beautiful Soup 4 with pip install beautifulsoup4.
See also
References
| references-column-width | references-column-count references-column-count-{{#if:1|{{{1}}}}} }} | {{#if: | references-column-width }} }}" style="{{#if: | {{#iferror: {{#ifexpr: 1 > 1 }} | Template:Column-width | Template:Column-count }} | {{#if: | Template:Column-width }} }} list-style-type: {{#switch: | upper-alpha | upper-roman | lower-alpha | lower-greek | lower-roman = {{{group}}} | #default = decimal}};">
- ↑ Template:Citation
- ↑ 2.0 2.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcrummy.com