文章

爬虫

爬虫是模拟浏览器上网抓取数据。

通用爬虫是抓取系统的重要组成部分,抓取的是一整张页面数据;聚焦爬虫建立在通用爬虫的基础之上,抓取页面特定的局部内容;增量式爬虫检测网站中数据更新的情况,只会抓取网站中最新更新出来的数据。

反爬机制是门户网站通过制定相应的策略或技术手段,防止爬虫程序进行网站数据的爬取;反反爬策略破解门户网站中具备的反爬机制,获取数据。

robots.txt协议规定网站中可以爬取的数据范围。

http协议是服务端与客户端进行数据交互的一种形式。 请求标头:

  • User-Agent 请求载体的身份标识
  • Connection 请求完毕后是否断开连接,包括keep-alive、close 响应标头:
  • Content-Type 服务器响应回客户端的数据类型

https协议是包含加密的,安全的超文本传输协议。

https使用的是证书密钥加密。

request模块

requests是python中原生的给予网络请求的模块,用于模拟浏览器发请求。

1
2
3
4
5
6
7
# 指定url
url = 'https://dict.baidu.com/'
# 发起请求
r = requests.get(url=url)
# 获取相应数据
page_text = r.text
# 持久化存储

参数指定:

1
2
3
url = 'https://www.sogou.com/web'
params = {'query':'特朗普'}
r = requests.get(url=url, params=params)

UA伪装:门户网站会根据请求的User-Agent来判断请求是否为爬虫,爬虫应当伪装成浏览器。

1
2
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
r = requests.get(url=url, params=params, headers=headers)

持久化存储:

1
2
3
4
5
6
7
8
9
10
11
url = 'https://movie.douban.com/j/chart/top_list' # 豆瓣电影排行榜
params = {'type': '17', # 科幻
          'interval_id': '100:90',
          'action': '',
          'start': 0, # 开始标号
          'limit': 20} # 个数
response = requests.get(url=url, params=params, headers=headers)
list_data = response.json()
with open('douban.json', 'w', encoding='utf-8') as fp:
    json.dump(list_data, fp=fp, ensure_ascii=False)

图片保存

1
2
3
4
url = 'https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&h=920'
img_data = requests.get(url=url, headers=headers).content
with open('xiaomi.jpg', 'wb') as fp:
    fp.write(img_data)

有时候文本编码有问题:

1
2
3
4
5
6
7
page_text = requests.get(url=url, headers=headers).content.decode('gbk')

page = requests.get(url=url, headers=headers)
page.encoding = 'gbk'
page_text = page.text

text = text.encode('iso-8859-1').decode('utf-8')

补一个注释,如果遇到了乱码情况,可以:

1
2
3
4
5
6
7
8
9
response = requests.get(url)
print(response.encoding) # 查看网页编码 发现是 ISO-8859-1

# 在保存时使用这种编码
with open(page_path, 'w', encoding='ISO-8859-1') as fp:
    fp.write(page_text)

response.encoding = 'gbk' # 直接修改编码
print(response.encoding)

post请求:

1
2
3
4
5
post_url = 'https://fanyi.baidu.com/sug'
data = {'kw': 'chaos'}
response = requests.post(url=post_url, data=data, headers=headers)
dic = response.json() # 百度翻译返回的是json
# {'errno': 0, 'data': [{'k': 'chaos', 'v': 'n. 混乱; 杂乱; 紊乱;'}, {'k': 'chaos theory', 'v': 'n. 混沌理论;'}]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 'http://scxk.nmpa.gov.cn:81/xk/' 国家药品监督管理局 化妆品生产许可信息管理系统服务平台
post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
data = {'on': 'true',
        'page': '1',
        'pageSize': '15',
        'productName': '',
        'conditionType': '1',
        'applyname': '',
        'applysn': ''}
json_ids = requests.post(url=post_url, data=data, headers=headers).json()
# 获取公司的ID,通过ID才能获得详情数据
id_list = []
for dic in json_ids['list']:
    id_list.append(dic['ID'])

# http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=ff83aff95c5541cdab5ca6e847514f88 一家公司的详情页 但信息仍然是动态加载得来的

post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
all_detail_list = []
for id in id_list:
    data = {'id': id}
    detail_json = requests.post(
        url=post_url, data=data, headers=headers).json()
    all_detail_list.append(detail_json)

数据解析

正则表达式匹配,现在要获取下述块中的图片。

1
2
3
4
5
6
7
8
9
10
<div class="swiper-slide ">
    <a target="_blank" href="https://www.mi.com/a/h/18087.html" data-log_code="31pchomepagegallery000001#t=ad&amp;act=webview&amp;page=homepage&amp;page_id=10530&amp;bid=3480927.1&amp;adp=3131&amp;adm=25177">
        <img class="swiper-lazy" src="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&amp;h=920" alt="" key="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/537e0430d5c1b77f0d5123d6bcfc25db.jpg?w=2452&amp;h=920">
    </a>
</div>
<div class="swiper-slide ">
    <a target="_blank" href="https://www.mi.com/buy/detail?product_id=10000204" data-log_code="31pchomepagegallery000001#t=ad&amp;act=webview&amp;page=homepage&amp;page_id=10530&amp;bid=3480927.2&amp;adp=3132&amp;adm=25133">
        <img class="swiper-lazy" data-src="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/cf6ba4d372b80e939104cf369f14139a.jpg?w=2452&amp;h=920" alt="" key="https://cdn.cnbj1.fds.api.mi-img.com/mi-mall/cf6ba4d372b80e939104cf369f14139a.jpg?w=2452&amp;h=920">
    </a>
</div>
1
2
3
4
5
6
7
8
9
10
11
url = 'https://www.mi.com/'
page_text = requests.get(url=url, headers=headers).text
pattern='<img class="swiper-lazy".*?src="(.*?)".*?</div>'
img_src_list = re.findall(pattern, page_text, re.S)

pattern='mi-mall/(.*?jpg)'
for src in img_src_list:
    img_data = requests.get(url=src, headers=headers).content
    img_name = re.search(pattern, src).group(1)
    with open(img_name, 'wb') as fp:
        fp.write(img_data)

bs4是python独有的解析方式。

本地html:

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
# 本地html
with open(path, 'r', encoding='utf-8') as fp:
    soup = BeautifulSoup(fp, 'lxml')
# 网页html
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')

元素查找:

1
2
3
4
5
soup.img  # 第一个img
soup.find('img')  # 第一个img
soup.find_all('img')  # 全部选择器
soup.find_all('img', class_='lazyload')  # 属性查找# class要带下划线,避免和关键字混淆
soup.select('div.charimg > img')  # 选择器返回全部

元素数据提取:

1
2
3
4
5
6
7
8
9
10
11
d.text  # 其下的全部文本
d.get_text()  # 其下的全部文本
d.string  # 直接包含的文本内容

imgs = soup.select('div.charimg > img')
for img in imgs:
    src = img['src'] # 元素属性
    img_data = requests.get(url=src, headers=headers).content
    img_name = unquote(src.split('/')[-1]) # url解码为中文 import urllib.parse
    with open(img_name, 'wb') as fp:
        fp.write(img_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
url = 'http://www.biquge.info/24_24159/'
page_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')

titles = soup.select('div.box_con dl a')
with open('titles.txt', 'w', encoding='iso-8859-1') as fp:
    for title in titles:
        fp.write(title.string+"\n")

with open('contents.txt', 'w', encoding='iso-8859-1') as fp:
    for title in titles:
        new_url = url+title['href']
        content_text = requests.get(url=new_url, headers=headers).text
        fp.write(title.string)
        pattern = '<div id="content"><!--go-->(.*?)<!--over-->'
        contents = re.search(pattern, content_text, re.S).group(1)
        contents = contents.replace('&nbsp;', ' ')
        contents = contents.replace('<br/>', '\n')
        fp.write(contents)
        fp.write('\n\n\n')
        fp.flush()

xpath是最常用且最便捷高效的解析方式

1
2
3
4
5
6
7
8
from lxml import etree

# 本地html
tree = etree.parse(path)

# 网页url
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)

标签定位。

1
2
3
4
5
6
r = tree.xpath('/html/head/title') # 返回Element的列表,/开头表示从根标签开始
r = tree.xpath('/html//title') # 两个斜杠//表示多个层级
r = tree.xpath('//title') # 所有的title标签
r = tree.xpath('//div[@class="classname"]') # 属性定位
r = tree.xpath('//div/p[3]') # 索引定位,从1开始
r = tree.xpath('//div/a | //div/p') # 定位两个标签

文本获取。

1
2
3
r = tree.xpath('//a/text()') # 直系文本,依然存储在列表中
r = tree.xpath('//a//text()') # 标签下的全部内容
r = tree.xpath('//img/@src') # 属性值

实例

1
2
3
4
5
6
7
8
9
url = 'https://www.58.com/ershoufang/' # 58同城二手房
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
trs=tree.xpath('//div[@id="global"]//tr')
for tr in trs:
    title = tr.xpath('./td[2]/a/text()')[0] # ./从当前局部开始
    price = tr.xpath('./td[3]//text()')[0]
    print(title)
    print(price+"")
1
2
3
4
5
6
7
8
9
10
url = 'http://prts.wiki/w/干员一览'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
divs = tree.xpath('//div[@class="smwdata"]')
for div in divs:
    img_url = div.xpath('./@data-icon')[0]
    name = unquote(img_url.split('/')[-1])
    img_data = requests.get(url=img_url, headers=headers).content
    with open(os.path.join('icon',name), 'wb') as fp:
        fp.write(img_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
url = 'https://pvp.qq.com/web201605/js/herolist.json' # 王者荣耀英雄一览
hero_list = requests.get(url=url, headers=headers).json()
for hero in hero_list:
    cname = hero["cname"]
    detail_url = "https://pvp.qq.com/web201605/herodetail/{}.shtml".format(ename) # 英雄详细信息
    detail_page = requests.get(url=detail_url, headers=headers).content.decode('gbk')
    tree = etree.HTML(detail_page)
    pics = tree.xpath('//div[@class="pic-pf"]/ul/@data-imgname')[0]
    pic_name_list = pics.split('|')
    for i, pic_name in enumerate(pic_name_list):
        pic_url = 'https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/{}/{}-bigskin-{}.jpg'.format(ename,ename, i+1) # 皮肤海报
        pic_name = pic_name.split('&')[0]
        img_data = requests.get(url=pic_url, headers=headers).content
        with open(os.path.join('skin',cname+'_'+pic_name+'.jpg'), 'wb') as fp:
            fp.write(img_data)

模拟登录

登陆时,将登陆信息post到服务端。验证码是反爬机制,对应的反反爬策略是验证码识别。

http/https协议特性是无状态,服务端不会保留用户的请求状态。cookie用来让服务端记录客户端的相关状态。

1
2
3
4
headers['cookie'] = cookie
page_text = requests.get(url, headers=headers).text
response = requests.post(url=url, data=data, headers=headers) # 携带登录信息
print(response.status_code) # 响应的状态码,200表示成功

利用session会话对象,可以自动进行cookie的获取和携带。

1
2
3
session = requests.Session()
response = session.post(url=url, data=data, headers=headers)
page_text = session.get(url=url, headers=headers).text

服务端可能会限制IP的访问次数,对应的反反爬策略是代理。代理可以突破自身IP的访问限制,还可以隐藏自身的IP。

1
2
proxies={'https':'121.230.55.133:9999'}
page_text = requests.get(url, headers=headers, proxies=proxies).text

异步爬虫

多线程:

1
2
3
from multiprocessing import Pool
pool = Pool(4)
ret = pool.map(func, array)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def func(img_url):
    name = unquote(img_url.split('/')[-1])
    img_data = requests.get(url=img_url, headers=headers).content
    with open(os.path.join('icon', name), 'wb') as fp:
        fp.write(img_data)


if __name__ == "__main__":
    url = 'http://prts.wiki/w/干员一览'
    page_text = requests.get(url=url,   headers=headers).text
    tree = etree.HTML(page_text)
    img_urls = tree.xpath('//div[@class="smwdata"]/@data-icon')
    pool = Pool(10)
    pool.map(func, img_urls)
    pool.close()
    pool.join()

单线程+异步协程。

使用task或future:

1
2
3
4
5
6
7
8
9
10
11
12
13
async def func():
    pass

c = func()  # 获得协程对象
loop = asyncio.get_event_loop()  # 创建事件循环

# 使用task
task = loop.create_task(c)  # 获取task对象
loop.run_until_complete(task)  # 注册并启动事件循环

# 使用future
future = asyncio.ensure_future(c)  # 获取future对象
loop.run_until_complete(future)  # 注册并启动事件循环

绑定回调函数:

1
2
3
4
5
6
def callback_func(future):
    print(future.result()) # 打印任务的返回值
# 使用future
future = asyncio.ensure_future(c)  # 获取future对象
future.add_done_callback(callback_func)
loop.run_until_complete(future)  # 注册并启动事件循环
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import aiohttp
import aiofiles

async def func(img_url):  # 定义协程函数,返回协程对象
    name = unquote(img_url.split('/')[-1])
    async with aiohttp.ClientSession() as session: # 异步的模块aiohttp
        # get()/post() headers/params/data proxy='http://ip:port'
        async with await session.get(url=img_url, headers=headers) as response: # await表示手动挂起这个函数
            img_data = await response.read() # 二进制形式相应数据 字符串需要text() json需要json()
            # async with aiofiles.open(os.path.join('icon', name), 'wb') as fp: # 异步文件io
            #     await fp.write(img_data)
            with open(os.path.join('icon', name), 'wb') as fp:
                fp.write(img_data)


if __name__ == "__main__":
    url = 'http://prts.wiki/w/干员一览'
    page_text = requests.get(url=url,   headers=headers).text
    tree = etree.HTML(page_text)
    img_urls = tree.xpath('//div[@class="smwdata"]/@data-icon')
    tasks = [asyncio.ensure_future(func(img_url)) for img_url in img_urls] # 获取协程对象,装填为任务
    loop = asyncio.get_event_loop()  # 创建事件循环
    loop.run_until_complete(asyncio.wait(tasks))

selenium

selenium是一个基于浏览器自动化的模块。需要相应浏览器的驱动程序。

selenium可以便捷地获得网页动态加载的数据,以及便捷地模拟登陆。

这个网址更新似乎快一些

浏览器初始化。

1
2
3
4
5
6
7
8
9
10
11
12
# bro = webdriver.Chrome(executable_path='./chromeriver') # 填入驱动程序
# driver = webdriver.Chrome() # 已经将驱动程序放置在Python的Scripts目录下
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
url = "http://scxk.nmpa.gov.cn:81/xk/"
driver.get(url)
page_text = driver.page_source
tree = etree.HTML(page_text)
name_list = tree.xpath('//*[@id="gzlist"]/li/dl/@title')
[print(name) for name in name_list]
driver.quit()

启动配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
options = Options()
chrome_options = webdriver.ChromeOptions() # from .chrome.options import Options as ChromeOptions
# 设置为开发者模式,防止被识别出来使用了selenium
# chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])  # 禁止打印日志
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])  # 跟上面只能选一个
chrome_options.add_argument('--headless')  # 无头模式
chrome_options.add_argument('--disable-gpu')  # 上面代码就是为了将Chrome不弹出界面
chrome_options.add_argument('--start-maximized')  # 最大化
chrome_options.add_argument('--incognito')  # 无痕隐身模式
chrome_options.add_argument("disable-cache")  # 禁用缓存
chrome_options.add_argument('disable-infobars')
chrome_options.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
browser = webdriver.Chrome(chrome_options=chrome_options)
1
2
3
4
5
6
7
8
9
10
url = "https://www.tmall.com/"
driver.get(url)
search_input = driver.find_element_by_id('mq')
search_input.send_keys('五年高考三年模拟')
btn = driver.find_element_by_xpath('//*[@id="mallSearch"]/form/fieldset/div/button') # elements会返回列表
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
btn.click()
driver.back()
driver.forward()
driver.quit()

当网页中使用了iframe嵌套网页时,必须切换到相应的作用域。

1
2
3
4
5
url = "https://www.runoob.com/try/try-cdnjs.php?filename=jqueryui-api-droppable"
driver.get(url)
driver.switch_to.frame('iframeResult')
driver.maximize_window()
div = driver.find_element_by_id('draggable')

动作链的使用,调用ActionChains的方法时,会将所有的操作按顺序存放在一个队列里,调用perform()方法时,队列中的事件会依次执行。

1
2
3
4
5
6
7
from selenium.webdriver import ActionChains
action = ActionChains(driver)
action.click_and_hold(div)
# action.move_by_offset(xoffset=250, yoffset=0)
action.drag_and_drop_by_offset(div, xoffset=250, yoffset=0).perform()
action.release()
action.perform() # 将动作链实现,其实支持链式编程
1
2
3
4
5
6
7
from PIL import Image
driver.save_screenshot(path) # 屏幕截图
location = div.location # 左上角的坐标
size = div.size
img = Image.open(path)
img = img.crop(rangle) # 裁剪
img.save(path)

scrapy框架

框架是继承了很多功能并且具有很强通用性的一个项目模板。

1
2
3
4
5
6
7
%创建scrapy工程%
scrapy startproject <name>
cd <name>
%创建爬虫文件%
scrapy genspider <spidername> <domain>
%运行爬虫%
scrapy crawl <spidername>

settings.py配置

1
2
3
ROBOTSTXT_OBEY = False # 遵循robots.txt协议
LOG_LEVEL = 'ERROR' # 日志只显示错误
USER_AGENT = 'First (+http://www.yourdomain.com)' # UA配置,可以进行UA伪装

爬虫文件:

1
2
3
4
5
6
7
8
9
10
    # <spidername> 爬虫名称,爬虫源文件的唯一标识
    name = 'example'
    # <domain> 允许的域名,用于限定起始url列表请求发送的范围,注释以不限制
    allowed_domains = ['example.com']
    # 起始的url列表,会被scrapy自动进行请求的发送
    start_urls = ['http://example.com/']

    # 对响应对象进行解析,parse()调用次数由起始url列表元素个数决定
    def parse(self, response):
        pass

数据解析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
start_urls = ['https://thwiki.cc/原曲列表']

    def parse(self, response):
        # xpath()一定返回列表,列表内是Selector
        h2s = response.xpath('//*[@id="mw-content-text"]/div/div/h2')
        uls = response.xpath('//*[@id="mw-content-text"]/div/div/ul')
        h2s.pop()
        for h2, ul in zip(h2s, uls):
            title = h2.xpath('./span[@class="mw-headline"]/text()')[0].extract() # 读取数据,列表也可以使用
            musics = ul.xpath('./li/a/text()').extract()
            musics_jp = ul.xpath('./li/span/text()').extract()
            print(title)
            [print('-', music, '/', music_jp)
             for music, music_jp in zip(musics, musics_jp)]

持久化存储:

终端指令的持久化存储,将parse()返回值存储到文件。

只能存储json/jsonlines/jl/csv/xml/marshal/pickle。

1
scrapy crawl <spidername> -o <path>
本文由作者按照 CC BY 4.0 进行授权