先说思路

用requests库下载网页文本文件->用BeautifulSoup筛选出有用的信息->保存

requests库

发送get请求

r = requests.get('https://www.baidu.com')

添加参数


data={
'name':'琪露诺',
'age':24
}

header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://www.google.com",
"Connection": "keep-alive"
}

r = requests.get('https://www.baidu.com',params=data,headers=header)

读取内容

response.encoding = 'utf-8'
print(response.text)

BeautifulSoup库

实例化BeautifulSoup对象

soup = BeautifulSoup(response.text, 'lxml') 

筛选需要的信息

items1 = soup.select("body>div.layout.layout--728-250>div.layout-left>div.cc-content.service-area>div.list.clearfix>a")
#返回set对象

输出标签属性的几种方法
for i in items1:
print(i.get('href'))
print(i.get_text)

一个简单的例子

from bs4 import BeautifulSoup as bs
import requests

for page in range(0,255,25):
url="https://movie.douban.com/top250?start={0}".format(page)
#print(url)

ua={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}
#get请求
response=requests.get(url,headers=ua)
html_text=response.text
#实例化对象
soup = bs(html_text, 'lxml')
soup_lists=soup.find_all('div', class_='item')
#排名-中文片名-评分-链接
for list in soup_lists:
rank = list.find_all('em')[0].get_text()
link = list.select('div')[0].select('a')[0].get("href")
title = list.select('div')[0].select('a')[0].select('img')[0].get("alt")
score = list.select('div')[1].select('div')[1].select('div')[0].select('span')[1].get_text()
print(rank, title, score, link)

结果如下