当前位置：首页 > news >正文

Python 爬虫实战：手把手教你抓取网页数据

news 2025/10/18 22:41:00

在当今数字化时代，网络爬虫已成为数据采集的重要工具。通过爬虫，我们可以从互联网上获取大量有价值的信息，用于数据分析、研究或其他目的。今天，就让我们通过一个简单的实战案例，手把手教你如何使用 Python 抓取网页数据。

一、准备工作

在开始之前，确保你已经安装了 Python 和以下必要的库：

requests：用于发送 HTTP 请求。
beautifulsoup4：用于解析 HTML 内容。

如果尚未安装这些库，可以通过以下命令进行安装：

pip install requests beautifulsoup4

二、实战案例：抓取网页标题

（一）发送 HTTP 请求

首先，我们需要使用 requests 库向目标网页发送 HTTP 请求，获取网页内容BestVideo。

import requests# 目标网页的 URL
url = "https://example.com"# 发送 GET 请求
response = requests.get(url)# 检查请求是否成功（状态码 200 表示成功）
if response.status_code == 200:print("请求成功!")html_content = response.text  # 获取网页的 HTML 内容
else:print(f"请求失败，状态码：{response.status_code}")

（二）解析 HTML 内容

接下来，使用 BeautifulSoup 库解析获取到的 HTML 内容bt1LOU。

from bs4 import BeautifulSoup# 解析 HTML 内容
soup = BeautifulSoup(html_content, 'html.parser')# 提取网页标题
title = soup.title.string
print(f"网页标题：{title}")

（三）完整代码

将上述步骤整合到一起，形成一个完整的爬虫脚本。

import requests
from bs4 import BeautifulSoup# 目标网页的 URL
url = "https://example.com"# 发送 GET 请求
response = requests.get(url)# 检查请求是否成功
if response.status_code == 200:# 解析 HTML 内容soup = BeautifulSoup(response.text, 'html.parser')# 提取网页标题title = soup.title.stringprint(f"网页标题：{title}")
else:print(f"请求失败，状态码：{response.status_code}")

三、进阶技巧

（一）处理分页

如果目标网页包含分页，可以通过循环访问每一页并提取数据。

# 假设每一页的 URL 格式为 https://example.com/page/1
for page in range(1, total_pages + 1):page_url = f"https://example.com/page/{page}"response = requests.get(page_url)if response.status_code == 200:soup = BeautifulSoup(response.text, 'html.parser')# 提取数据...

（二）动态内容抓取

如果网页内容是通过 JavaScript 动态加载的，可以使用 Selenium 或 Playwright 等工具来模拟浏览器操作Manwa。

from selenium import webdriverdriver = webdriver.Chrome()
driver.get("https://example.com")
# 等待页面加载完成
# 提取数据...
driver.quit()

（三）遵守 `robots.txt`

在抓取数据之前，务必检查目标网站的 robots.txt 文件，了解哪些内容是可以抓取的。

# 检查 robots.txt 文件
robots_url = "https://example.com/robots.txt"
response = requests.get(robots_url)
if "Disallow: /" in response.text:print("该网站不允许爬虫访问")
else:print("该网站允许爬虫访问")