Scrapy爬虫框架 深入理解Shell 和日志Debug信息

鳄鱼君

发表文章数:642

Vieu四代商业主题

高扩展、安全、稳定、响应式布局多功能模板。

¥69 现在购买
首页 » Python » Scrapy爬虫框架 深入理解Shell 和日志Debug信息

Scrapy Shell深入理解

Scrapy shell是一个交互终端,我们可以在未启动spider爬虫的情况下尝试及调试代码,也可以用来测试XPath表达式是否正确,方便我们提取到准确的数据。

response.url   #当前响应的url地址
response.request.url   #当前响应对应的请求的url
response.headers   #响应头
response.body   #响应体,就是html代码,类型为byte字节类型,可以decode()转换
response.requests.headers #当前响应的请求头 

Scrapy深入理解DEBUG

通常我们启动一个爬虫,是会输出以下信息的(我删掉了日期):

[scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: myspider)
[scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twi
sted 19.2.1, Python 3.7.3, Platform Windows-7-6.1.7601-SP1
[scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'myspider', 'NEWSPIDER_MODULE': 'myspider.spiders', 'RO
BOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myspider.spiders']}#重启一些设置,settings中的配置
[scrapy.middleware] INFO: Enabled extensions: #已启用的插件扩展
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',#控制台,调试程序
 'scrapy.extensions.logstats.LogStats']
[scrapy.middleware] INFO: Enabled downloader middlewares:#已启用的下载程序中间件
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[scrapy.middleware] INFO: Enabled spider middlewares:#启用的蜘蛛爬虫中间件
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[scrapy.middleware] INFO: Enabled item pipelines:
......
[scrapy.core.engine] INFO: Spider opened
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET xxx.com/robots.txt> from <GET xxx.com/robots.txt>
[scrapy.core.engine] DEBUG: Crawled (200) <GET xxx.com/robots.txt> (referer: None)
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET ...> from <GET ...>
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://sz.esf.fang.com/> (referer: None)
[scrapy.core.engine] INFO: Closing spider (finished)
[scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 876,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 52832,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 3, 8, 5, 37, 12, 513124),
 'log_count/DEBUG': 4,
 'log_count/INFO': 9,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 3, 8, 5, 37, 10, 613015)}
[scrapy.core.engine] INFO: Spider closed (finished)
 

我们需要关注的主要是DEBUG信息,特会显示一些错误信息,比方说,遇到robots.txt文件(网站的robots禁止爬虫),或者过滤filtered offsite 请求(通常是因为爬取的域名不在可允许的范围)。那么我们在刚开始的时候不要在settings

在spider中settings能够通过self.settings的方式获取到:

class MySpider(scrapy,Spider):
    ...
    def parse(self,response):
        print("Existing settings: %s " % self.settings.attributes.keys())
 

在pipelines中使用spider.settings.get()的方式获取到:

class MyspiderPipeline(object):
    def open_spider(self, item, spider):
        print(spider.settings.get("MONGO_TABLE",None))
 

那么又有个新的疑问,open_spider是干什么的,请参考:https://www.e1yu.com/7174.html。

未经允许不得转载:作者:鳄鱼君, 转载或复制请以 超链接形式 并注明出处 鳄鱼君
原文地址:《Scrapy爬虫框架 深入理解Shell 和日志Debug信息》 发布于2020-05-08

分享到:
赞(0) 赏杯咖啡

评论 抢沙发

2 + 5 =


文章对你有帮助可赏作者一杯咖啡

支付宝扫一扫打赏

微信扫一扫打赏

Vieu4.6主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。
切换注册

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录
切换登录

注册