Python3爬虫-Scrapy笔记2

pipeline

pipeline作用

  1. 可以有多个pipeline,对爬取的item数据进行加工,修饰以及清洗
  2. pipeline的权重越小优先级越高

pipeline 定义

1
2
3
4
5
6
7
8
9
class MyscrapyPipeline(object):
def process_item(self, item, spider):
# item 具体数据
# spider 跟随着数据的爬虫对象,可以获取name属性,来判断来子哪个爬虫
return item #必须return,不然后续的pipeline就无法获取数据
def open_spider(self,spider): #在爬虫开启的时候,仅执行一次
pass
def close_spider(self,spider): #在爬虫关闭的时候,仅执行一次
pass

在 settings.py中开启item_pipline

1
2
3
4
5
6
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# item 管道
ITEM_PIPELINES = {
'yangguang.pipelines.MyscrapyPipeline': 300,
}

Scrapy模块使用

logging模块

在settings.py文件中使用LOG_LEVEL标签,在代码中使用logging模块,打印日志

1
2
3
4
5
6
7
8
9
#settings.py设置,指定日志级别
LOG_LEVEL = 'INFO' # DEBUG,INFO,WARNING,ERROR
LOG_FILE = "日志文件路径" #设置日志文件

#代码
import logging
logging.debug()
logging.info()
...

爬虫设置

settings.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

# -*- coding: utf-8 -*-

# Scrapy settings for yangguang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#项目名称
BOT_NAME = 'yangguang'

#爬虫模块位置
SPIDER_MODULES = ['yangguang.spiders']
#新的 爬虫模块位置
NEWSPIDER_MODULE = 'yangguang.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 头部 User_AGENT
#USER_AGENT = 'yangguang (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 遵守robots协议
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发请求数量
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 每个域名的最大请求并发数
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
# 每隔IP的最大请求并发数
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# 是否 开启 coolkit
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# 是否 开启 终端打印日志
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# 默认请求头
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 爬虫 中间件
#SPIDER_MIDDLEWARES = {
# 'yangguang.middlewares.YangguangSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载 中间件
#DOWNLOADER_MIDDLEWARES = {
# 'yangguang.middlewares.YangguangDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# 扩展插件
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# item 管道
#ITEM_PIPELINES = {
# 'yangguang.pipelines.YangguangPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

# 自动限速开启
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 自动限速最小时间
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 自动限速最大时间
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# 开启http缓存
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'