HTML Parsing in Python

HTML Parsing in Python is another important parameter used by different programmers in performing different tasks. Learn more about it here.

Using CSS selectors in BeautifulSoup

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use SELECT() method to find multiple elements and select_one() to find a single element.

Basic example:

from bs4 import BeautifulSoup
data = “””

  • item1
  • item2
  • item3

“””

soup = BeautifulSoup(data, "html.parser")
for item in soup.select("li.item"):
print(item.get_text())

Prints:

item1
item2
item3

PyQuery

pyquery is a jquery-like library for python. It has very well support for css selectors.

from pyquery import PyQuery

html = “””

Sales

Lorem46
Ipsum12
Dolor27
Sit90

“””

doc = PyQuery(html)
title = doc('h1').text()
print title
table_data = []
rows = doc('#table > tr')
for row in rows:
name = PyQuery(row).find('td').eq(0).text()
value = PyQuery(row).find('td').eq(1).text()
print "%s\t %s" % (name, value)

HTML Parsing in Python: Locate a text after an element in BeautifulSoup

Imagine you have the following HTML: Name: John Smith

And you need to locate the text “John Smith” after the label element.

In this case, you can locate the label element by text and then use .next_sibling property:

from bs4 import BeautifulSoup
data = """ Name: John Smith
"""
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", text="Name:")
print(label.next_sibling.strip())
Prints John Smith.

Read more

LEAVE A REPLY

Please enter your comment!
Please enter your name here