Web Scraping with Python

Richard Lawson



Successfully scrape data from any website with the power of Python

About This Book

  • A hands-on guide to web scraping with real-life problems and solutions
  • Techniques to download and extract data from complex websites
  • Create a number of different web scrapers to extract information

Who This Book Is For

This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.

What You Will Learn

  • Extract data from web pages with simple Python programming
  • Build a threaded crawler to process web pages in parallel
  • Follow links to crawl a website
  • Download cache to reduce bandwidth
  • Use multiple threads and processes to scrape faster
  • Learn how to parse JavaScript-dependent websites
  • Interact with forms and sessions
  • Solve CAPTCHAs on protected web pages
  • Discover how to track the state of a crawl

In Detail

The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming.

This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.

Style and approach

This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.




  • 實際問題和解決方案的網絡爬蟲實踐指南

  • 從複雜網站下載和提取數據的技巧

  • 創建多種不同的網絡爬蟲以提取信息




  • 使用簡單的Python編程從網頁中提取數據

  • 構建多線程爬蟲以並行處理網頁

  • 跟隨鏈接爬取網站

  • 下載緩存以減少帶寬使用

  • 使用多線程和多進程進行更快的爬取

  • 學習如何解析依賴於JavaScript的網站

  • 與表單和會話進行交互

  • 解決受保護網頁上的CAPTCHA問題

  • 了解如何跟踪爬取的狀態



本書是使用Python從網站上爬取數據的最終指南。在前幾章中,它介紹了如何從靜態網頁中提取數據以及如何使用緩存來管理服務器負載。在基礎知識之後,我們將深入探討使用線程和更高級主題構建更複雜的爬蟲。逐步學習如何使用Ajax URL,使用Firebug擴展進行監控,以及間接爬取數據。還可以發現更多有關網絡爬蟲的細節,例如使用瀏覽器渲染器,管理cookie,如何提交表單以從受CAPTCHA保護的複雜網站中提取數據等等。本書以使用Scrapy庫創建高級爬蟲並將所學應用於真實網站的方式結束。