Web scraping with Python 2023

Web scraping with Python is an automated, programmatic process through which data can be constantly ‘scraped’ oﬀ webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping may be illegal.

Table of Contents

Web scraping with Python: Scraping using the Scrapy framework

First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject projectName

To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questions on StackOverflow and scrapes some data from each page (source):

import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow' # each spider has a unique name
start_urls = ['https://stackoverflow.com/questions?sort=votes'] # the parsing starts from a specific set of urls
def parse(self, response): # for each request this generator yields, its response is sent to parse_question
for href in response.css('.question-summary h3 a::attr(href)'): # do some scraping stuff using css selectors to find question urls
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract_first(),
'votes': response.css('.question .vote-count-post::text').extract_first(),
'body': response.css('.question .post-text').extract_first(),
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}

Save your spider classes in the projectName\spiders directory. In this case – projectName\spiders\stackoverflow_spider.py.

Now you can use your spider. For example, try running (in the project’s directory):

scrapy crawl stackoverflow

Scraping using Selenium WebDriver

Some websites don’t like to be scraped. In these cases you may need to simulate a real user working with a browser. Selenium launches and controls a web browser.

from selenium import webdriver
browser = webdriver.Firefox() # launch Firefox browser
browser.get(‘https://stackoverflow.com/questions?sort=votes’) # load url

title = browser.find_element_by_css_selector('h1').text # page title (first h1 element)
questions = browser.find_elements_by_css_selector('.question-summary') # question list
for question in questions: # iterate over questions
question_title = question.find_element_by_css_selector('.summary h3 a').text question_excerpt = question.find_element_by_css_selector('.summary .excerpt').text question_vote = question.find_element_by_css_selector('.stats .vote .votes .vote-count-
post').text
print "%s\n%s\n%s votes\n-----------\n" % (question_title, question_excerpt, question_vote)

Selenium can do much more. It can modify browser’s cookies, fill in forms, simulate mouse clicks, take screenshots of web pages, and run custom JavaScript.

Web scraping with Python: Basic example of using requests and lxml to scrape some data

For Python 2 compatibility.

from future import print_function
import lxml.html
import requests
def main():
r = requests.get("https://httpbin.org")
html_source = r.text
root_element = lxml.html.fromstring(html_source)
Note root_element.xpath() gives a list of results.
XPath specifies a path to the element we want.
page_title = root_element.xpath('/html/head/title/text()')[0] print(page_title)
if name == 'main':
main()

Maintaining web-scraping session with requests

It is a good idea to maintain a web-scraping session to persist the cookies and other parameters. Additionally, it can result into a performance improvement because requests.Session reuses the underlying TCP connection to a host:

import requests
with requests.Session() as session:

all requests through session now have User-Agent header set

session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

set cookies

session.get('https://httpbin.org/cookies/set?key=value')

get cookies

response = session.get('https://httpbin.org/cookies')
print(response.text)

Web scraping with Python: Scraping using BeautifulSoup4

from bs4 import BeautifulSoup
import requests

Use the requests module to obtain a page

res = requests.get('https://www.codechef.com/problems/easy')

Create a BeautifulSoup object

page = BeautifulSoup(res.text, 'lxml') # the text field contains the source of the page
Now use a CSS selector in order to get the table containing the list of problems datatable_tags = page.select('table.dataTable') # The problems are in the tag,
with class "dataTable"
We extract the first tag from the list, since that's what we desire
datatable = datatable_tags[0]
Now since we want problem names, they are contained in tags, which are
directly nested under tags
prob_tags = datatable.select('a > b')
prob_names = [tag.getText().strip() for tag in prob_tags]
print prob_names Section 92.6: Simple web content download with urllib.request The standard library module urllib.request can be used to download web content: from urllib.request import urlope
response = urlopen('https://stackoverflow.com/questions?sort=votes')
data = response.read()
The received bytes should usually be decoded according the response's character set encoding = response.info().get_content_charset()
html = data.decode(encoding)
A similar module is also available in Python 2. Section 92.7: Modify Scrapy user agent Sometimes the default Scrapy user agent ("Scrapy/VERSION (+https://scrapy.org)") is blocked by the host. To change the default user agent open settings.py, uncomment and edit the following line to whatever you want. USER_AGENT = 'projectName (+https://www.yourdomain.com)' For example USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko Chrome/51.0.2704.103 Safari/537.36' Section 92.8: Scraping with curl imports: from subprocess import Popen, PIPE
from lxml import etree GoalKicker.com – Python® Notes for Professionals 442 from io import StringIO Downloading: user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'https://stackoverflow.com'
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8') -s: silent download -A: user agent flag Parsing: tree = etree.parse(StringIO(result), etree.HTMLParser())
divs = tree.xpath('//div')

Learn More

Must Read Python Interview Questions

165+ Python Interview Questions & Answers

200+ Python Tutorials With Coding Examples

Python Language Basics Tutorial	Python String Representations of Class Instances
Python For Beginners Tutorial	Python Debugging Tutorial
Python Data Types Tutorial	Reading and Writing CSV File Using Python
Python Indentation Tutorial	Writing to CSV in Python from String/List
Python Comments and Documentation Tutorial	Python Dynamic Code Execution Tutorial
Python Date And Time Tutorial	Python Code Distributing using Pyinstaller
Python Date Formatting Tutorial	Python Data Visualization Tutorial
Python Enum Tutorial	Python Interpreter Tutorial
Python Set Tutorial	Python Args and Kwargs
Python Mathematical Operators Tutorial	Python Garbage Collection Tutorial
Python Bitwise Operators Tutorial	Python Pickle Data Serialisation
Python Bolean Operators Tutorial	Python Binary Data Tutorial
Python Operator Precedance Tutorial	Python Idioms Tutorial
Python Variable Scope And Binding Tutorial	Python Data Serialization Tutorial
Python Conditionals Tutorial	Python Multiprocessing Tutorial
Python Comparisons Tutorial	Python Multithreading Tutorial
Python Loops Tutorial	Python Processes and Threads
Python Arrays Tutorial	Python Concurrency Tutorial
Python Multidimensional Arrays Tutorial	Python Parallel Computation Tutorial
Python List Tutorial	Python Sockets Module Tutorial
Python List Comprehensions Tutorial	Python Websockets Tutorial
Python List Slicing Tutorial	Sockets Encryption Decryption in Python
Python Grouby() Tutorial	Python Networking Tutorial
Python Linked Lists Tutorial	Python http Server Tutorial
Linked List Node Tutorial	Python Flask Tutorial
Python Filter Tutorial	Introduction to Rabbitmq using Amqpstorm Python
Python Heapq Tutorial	Python Descriptor Tutorial
Python Tuple Tutorial	Python Tempflile Tutorial
Python Basic Input And Output Tutorial	Input Subset and Output External Data Files using Pandas in Python
Python Files And Folders I/O Tutorial	Unzipping Files in Python Tutorial
Python os.path Tutorial	Working with Zip Archives in Python
Python Iterables And Iterators Tutorial	gzip in Python Tutorial
Python Functions Tutorial	Stack in Python Tutorial
Defining Functions With List Arguments In Python	Working with Global Interpreter Lock (GIL)
Functional Programming In Python	Python Deployment Tutorial
Partial Functions In Python	Python Logging Tutorial
Decorators Function In Python	Python Server Sent Events Tutorial
Python Classes Tutorial	Python Web Server Gateway Interface (WSGI)
Python Metaclasses Tutorial	Python Alternatives to Switch Statement
Python String Formatting Tutorial	Python Packing and Unpacking Tutorial
Python String Methods Tutorial	Accessing Python Sourcecode and Bytecode
Using Loops Within Functions In Python	Python Mixins Tutorial
Python Importing Modules Tutorial	Python Attribute Access Tutorial
Difference Betweeb Module And Package In Python	Python Arcpy Tutorial
Python Math Module Tutorial	Python Abstract Base Class Tutorial
Python Complex Math Tutorial	Python Plugin and Extension Classes
Python Collections Module Tutorial	Python Immutable Datatypes Tutorial
Python Operator Module Tutorial	Python Incompatibilities Moving from Python 2 to Python 3
Python JSON Module Tutorial	Python 2to3 Tool Tutorial
Python Sqlite3 Module Tutorial	Non-Official Python implementations
Python os Module Tutorial	Python Abstract Syntax Tree
Python Locale Module Tutorial	Python Unicode and Bytes
Python Itertools Module Tutorial	Python Serial Communication (pyserial)
Python Asyncio Module Tutorial	Neo4j and Cypher using Py2Neo
Python Random Module Tutorial	Basic Curses with Python
Python Functools Module Tutorial	Templates in Python
Python dis Module Tutorial	Python Pillow
Python Base64 Module Tutorial	Python CLI subcommands with precise help output
Python Queue Module Tutorial	Python Database Access
Python Deque Module Tutorial	Connecting Python to SQL Server
Python Webbrowser Module Tutorial	Python and Excel
Python tkinter Tutorial	Python Turtle Graphics
Python pyautogui Module Tutorial	Python Persistence
Python Indexing And Slicing Tutorial	Python Design Patterns
Python Plotting With Matplotlib Tutorial	Python hashlib
Python Graph Tool Tutorial	Creating a Windows Service Using Python
Python Generators Tutorial	Mutable vs Immutable (and Hashable) in Python
Python Reduce Tutorial	Python configparser
Python Map Function Tutorial	Python Optical Character Recognition
Python Exponentiation Tutorial	Python Virtual Environments
Python Searching Tutorial	Python Virtual Environment – virtualenv
Sorting Minimum And Maximum In Python	Python Virtual environment with virtualenvwrapper
Python Print Function Tutorial	Create virtual environment with virtualenvwrapper in windows
Python Regular Expressions Regex Tutorial	Python sys Tutorial
Copying Data In Python Tutorial	ChemPy – Python package
Python Context Managers (“with” Statement) Tutorial	Python pygame
Python Name Special Variable Tutorial	Python pyglet
Checking Path Existence And Permissions In Python	Working with Audio in Python
Creating Python Packages Tutorial	Python pyaudio
Usage of pip Module In Python Tutorial	Python shelve
Python PyPi Package Manager Tutorial	IoT Programming with Python and Raspberry PI
Parsing Command Line Arguments In Python	kivy – Cross-platform Python Framework for NUI Development
Python Subprocess Library Tutorial	Pandas Transform
Python setup.py Tutorial	Python vs. JavaScript
Python Recursion Tutorial	Call Python from C#
Python Type Hints Tutorial	Python Writing Extensions
Python Exceptions Tutorial	Python Lex-Yacc
Raise Custom Exceptions In Python	Python Unit Testing
Python Commonwealth Exceptions Tutorial	Python py.test
Python urllib Tutorial	Python Profiling
Web Scraping With Python Tutorial	Python Speed of Program
Python HTML Parsing Tutorial	Python Performance Optimization
Manipulating XML In Python	Python Security and Cryptography
Python Requests Post Tutorial	Secure Shell Connection in Python
Python Distribution Tutorial	Python Anti Patterns
Python Property Objects Tutorial	Python Common Pitfalls
Python Overloading Tutorial	Python Hidden Features
Python Polymorphism Tutorial	Python For Machine Learning
Python Method Overriding Tutorial	Python Interview Questions And Answers For Experienced
Python User Defined Methods Tutorial	Python Coding Interview Questions And Answers

Python Programming Tutorials With Examples

Web scraping with Python