Beautiful Soup

From Hidden Wiki
Jump to navigation Jump to search

Template:Other uses {{#invoke:Infobox|infobox}}

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML[1], which is useful for web scraping.[2]

It is available for Python 2.7 and Python 3.

Code example

<source lang="python">

  1. !/usr/bin/env python3
  2. Anchor extraction from HTML document

from bs4 import BeautifulSoup from urllib.request import urlopen

with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:

   soup = BeautifulSoup(response, 'html.parser')
   for anchor in soup.find_all('a'):
       print(anchor.get('href', '/'))

</source>

Advantages and Disadvantages

This table summarizes the advantages and disadvantages of each parser library[2]

Parser Typical usage Advantages Disadvantages
Python’s html.parser BeautifulSoup(markup, "html.parser")
  • Batteries included
  • Decent speed
  • Lenient (As of Python 2.7.3 and 3.2.)
  • Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser BeautifulSoup(markup, "lxml")
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parser

BeautifulSoup(markup, "lxml-xml")
BeautifulSoup(markup, "xml")

  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5lib BeautifulSoup(markup, "html5lib")
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

Release

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012.
The current release is Beautiful Soup 4.8.2[1] (December 24, 2019)
You can install Beautiful Soup 4 with pip install beautifulsoup4.

See also

References

1 }}
     | references-column-width 
     | references-column-count references-column-count-{{#if:1|{{{1}}}}} }}
   | {{#if: 
     | references-column-width }} }}" style="{{#if: 
   | {{#iferror: {{#ifexpr: 1 > 1 }}
     | Template:Column-width
     | Template:Column-count }}
   | {{#if: 
     | Template:Column-width }} }} list-style-type: {{#switch: 
   | upper-alpha
   | upper-roman
   | lower-alpha
   | lower-greek
   | lower-roman = {{{group}}}
   | #default = decimal}};">
  1. Template:Citation
  2. 2.0 2.1 Cite error: Invalid <ref> tag; no text was provided for refs named crummy.com


Template:Compu-library-stub