Beautiful Soup

Template:Other uses {{#invoke:Infobox|infobox}}

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML^[1], which is useful for web scraping.^[2]

It is available for Python 2.7 and Python 3.

Code example

!/usr/bin/env python3
Anchor extraction from HTML document

from bs4 import BeautifulSoup from urllib.request import urlopen

with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:

   soup = BeautifulSoup(response, 'html.parser')
   for anchor in soup.find_all('a'):
       print(anchor.get('href', '/'))

</source>

Advantages and Disadvantages

This table summarizes the advantages and disadvantages of each parser library^[2]

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	BeautifulSoup(markup, "html.parser")	Batteries included Decent speed Lenient (As of Python 2.7.3 and 3.2.)	Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser	BeautifulSoup(markup, "lxml")	Very fast Lenient	External C dependency
lxml’s XML parser	BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")	Very fast The only currently supported XML parser	External C dependency
html5lib	BeautifulSoup(markup, "html5lib")	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

Release

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012.
The current release is Beautiful Soup 4.8.2[1] (December 24, 2019)
You can install Beautiful Soup 4 with pip install beautifulsoup4.

References

1 }}

     | references-column-width 
     | references-column-count references-column-count-{{#if:1|{{{1}}}}} }}
   | {{#if: 
     | references-column-width }} }}" style="{{#if: 
   | {{#iferror: {{#ifexpr: 1 > 1 }}
     | Template:Column-width
     | Template:Column-count }}
   | {{#if: 
     | Template:Column-width }} }} list-style-type: {{#switch: 
   | upper-alpha
   | upper-roman
   | lower-alpha
   | lower-greek
   | lower-roman = {{{group}}}
   | #default = decimal}};">

↑ Template:Citation
↑ ^2.0 ^2.1 Cite error: Invalid <ref> tag; no text was provided for refs named crummy.com

Template:Compu-library-stub

[1] Template:Citation

[crummy.com-2] 2.0 ^2.1 Cite error: Invalid <ref> tag; no text was provided for refs named crummy.com

[1]

[2]

Beautiful Soup

Contents

Code example

Advantages and Disadvantages

Release

See also

References

Navigation menu

Search