Instant Web Scraping with Java

Ryan Mitchell

商品描述

Build simple scrapers or vast armies of Java-based bots to untangle and capture the Web

Overview

  • Learn something new in an Instant! A short, fast, focused guide delivering immediate results
  • Get your Java environment set up and running
  • Gather clean, formatted web data into your own database
  • Learn how to work around crawler-resistant websites and legally subvert security measures
  • Use built-in Java features to perform parallel processing and distributed scraping
  • Build test cases for your own websites using JUnit

In Detail

Java is often thought of as a stuffy enterprise language, while web scraping is the often-murky domain of scripting languages. By combining the robustness and extensibility of Java with the flexibility and power of web scraping, we can create immensely useful tools that can solve very difficult problems.

Instant Web Scraping with Java will guide you, step by step, through setting up your Java environment. You will also learn how to write simple web scrapers and distributed networks of crawlers. Throughout the book, we will provide useful tips, out-of-the-box working code, and additional resources to build expert knowledge.

Instant Web Scraping with Java will teach how to build your own web scrapers using real-world scraping examples that collect and store data from Wikipedia, public records data sites, IP address geolocation services, and more. You will learn how to run scrapers across multiple servers, run them in parallel, and subvert common methods of anti-scraper security used on modern websites. This book will also provide you with detailed step-by-step instructions, out-of-the-box working code, and expert pointers to further resources on key topics.

Instant Web Scraping with Java will show you how to view and collect any Internet data at the speed of your processor!

What you will learn from this book

  • Set up your Java environment and work with the Eclipse IDE
  • Execute complicated web crawlers that run without intervention
  • Handle errors, documentation, and writing robust code
  • Log scraped data for later retrieval and analysis
  • Write code to test website content and functionality with the JUnit framework
  • Learn techniques for getting around website security, designed to prevent automated scraping
  • Fill and submit forms automatically
  • Use threading to run scrapers in parallel
  • Use Java’s Remote Machine Invocation to create multi-server distributed scrapers

Approach

Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. This book is full of short, concise recipes to learn a variety of useful web scraping techniques using Java. You will start with a simple basic recipe of setting up your Java environment and gradually learn some more advanced recipes such as using complex Scrapers.

Who this book is written for

Instant Web Scraping with Java is aimed at developers who, while not necessarily familiar with Java, are at least ready to dive into the complexities of this language with simple, step-by-step instructions leading the way. It is assumed that you have at least an intermediate knowledge of HTML, some knowledge of MySQL, and access to an Internet-connected computer while doing most of the exercises (after all, scraping the Web is difficult if your code can’t get online!)

商品描述(中文翻譯)

使用Java建立簡單的網頁爬蟲或龐大的基於Java的機器人軍團,以解析和捕獲網頁。

概述:
- 立即學習新知識!提供快速、簡潔、專注的指南,立即獲得結果。
- 設置並運行您的Java環境。
- 將乾淨、格式化的網頁數據收集到您自己的數據庫中。
- 學習如何解決抵抗爬蟲的網站並合法地繞過安全措施。
- 使用內置的Java功能進行並行處理和分佈式爬取。
- 使用JUnit為您自己的網站構建測試用例。

詳細內容:
Java通常被認為是一種呆板的企業語言,而網頁爬蟲則是腳本語言的模糊領域。通過結合Java的強大性和可擴展性以及網頁爬蟲的靈活性和功能,我們可以創建非常有用的工具,解決非常困難的問題。

《Java網頁爬蟲即時指南》將逐步指導您設置Java環境。您還將學習如何編寫簡單的網頁爬蟲和分佈式爬蟲網絡。在整本書中,我們將提供有用的提示、即可使用的工作代碼和其他資源,以建立專家知識。

《Java網頁爬蟲即時指南》將教您如何使用真實的爬蟲示例構建自己的網頁爬蟲,這些示例可以從維基百科、公共記錄數據網站、IP地址地理位置服務等收集和存儲數據。您將學習如何在多個服務器上運行爬蟲,並且可以並行運行它們,並繞過現代網站上常用的反爬蟲安全方法。本書還將提供詳細的逐步指示、即可使用的工作代碼和專家指引,以進一步了解關鍵主題的資源。

《Java網頁爬蟲即時指南》將向您展示如何以處理器速度查看和收集任何互聯網數據!

您將從本書中學到的內容:
- 設置Java環境並使用Eclipse IDE進行工作。
- 執行無需干預的複雜網頁爬蟲。
- 處理錯誤、文檔和編寫健壯的代碼。
- 記錄爬取的數據以供以後檢索和分析。
- 使用JUnit框架編寫測試網站內容和功能的代碼。
- 學習繞過旨在防止自動爬取的網站安全技術。
- 自動填充和提交表單。
- 使用線程並行運行爬蟲。
- 使用Java的遠程機器調用創建多服務器分佈式爬蟲。

方法:
本書充滿了實用的、逐步的指示和清晰的解釋,涵蓋了最重要和最有用的任務。本書提供了簡短、簡潔的食譜,以使用Java學習各種有用的網頁爬蟲技術。您將從簡單的基本食譜開始,設置Java環境,並逐漸學習一些更高級的食譜,例如使用複雜的爬蟲。

本書的讀者對象:
《Java網頁爬蟲即時指南》針對的是開發人員,他們可能不熟悉Java,但至少已準備好以簡單、逐步的指示為引導,深入研究這門語言的複雜性。假設您至少具有HTML的中級知識,對MySQL有一些了解,並且在大多數練習中可以使用連接到互聯網的計算機(畢竟,如果您的代碼無法在線上運行,爬取網頁就很困難!)