酒店品牌设计网站建设,关键词推广公司,做淘客网站用什么上传文件,网站的宣传方法有哪些scrapy常用的命令分为全局和项目两种命令#xff0c;全局命令就是不需要依靠scrapy项目#xff0c;可以在全局环境下运行#xff0c;而项目命令需要在scrapy项目里才能运行。一、全局命令##使用scrapy -h可以看到常用的全局命令 [rootaliyun ~]# scrapy -h
Scrapy 1.5.0 - n…scrapy常用的命令分为全局和项目两种命令全局命令就是不需要依靠scrapy项目可以在全局环境下运行而项目命令需要在scrapy项目里才能运行。一、全局命令##使用scrapy -h可以看到常用的全局命令 [rootaliyun ~]# scrapy -h
Scrapy 1.5.0 - no active projectUsage:scrapy command [options] [args]Available commands:bench Run quick benchmark testfetch Fetch a URL using the Scrapy downloadergenspider Generate new spider using pre-defined templatesrunspider Run a self-contained spider (without creating a project)settings Get settings valuesshell Interactive scraping consolestartproject Create new projectversion Print Scrapy versionview Open URL in browser, as seen by Scrapy在bench下面的都是全局命令bench是特殊的即使在Available 下面展示但仍然属于项目命令。1、fetch命令##fetch主要用来显示爬虫爬取的过程。scrapy fetch 网址 [rootaliyun ~]# scrapy fetch http://www.baidu.com
2018-03-15 10:50:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-15 10:50:02 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017), cryptography 2.1.4, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-03-15 10:50:02 [scrapy.crawler] INFO: Overridden settings: {}
2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled extensions:
[scrapy.extensions.memusage.MemoryUsage,scrapy.extensions.telnet.TelnetConsole,scrapy.extensions.corestats.CoreStats,scrapy.extensions.logstats.LogStats]
2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
[scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware,scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware,scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware,scrapy.downloadermiddlewares.useragent.UserAgentMiddleware,scrapy.downloadermiddlewares.retry.RetryMiddleware,scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware,scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware,scrapy.downloadermiddlewares.redirect.RedirectMiddleware,scrapy.downloadermiddlewares.cookies.CookiesMiddleware,scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware,scrapy.downloadermiddlewares.stats.DownloaderStats]
2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled spider middlewares:
[scrapy.spidermiddlewares.httperror.HttpErrorMiddleware,scrapy.spidermiddlewares.offsite.OffsiteMiddleware,scrapy.spidermiddlewares.referer.RefererMiddleware,scrapy.spidermiddlewares.urllength.UrlLengthMiddleware,scrapy.spidermiddlewares.depth.DepthMiddleware]
2018-03-15 10:50:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-15 10:50:02 [scrapy.core.engine] INFO: Spider opened
2018-03-15 10:50:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-15 10:50:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-15 10:50:02 [scrapy.core.engine] DEBUG: Crawled (200) GET http://www.baidu.com (referer: None)
!DOCTYPE html
!--STATUS OK--html headmeta http-equivcontent-type contenttext/html;charsetutf-8meta http-equivX-UA-Compatible contentIEEdgemeta contentalways namereferrerlink relstylesheet typetext/css hrefhttp://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.csstitle百度一下你就知道/title/head body link#0000cc div idwrapper div idhead div classhead_wrapper div classs_form div classs_form_wrapper div idlg img hidefocustrue src//www.baidu.com/img/bd_logo1.png width270 height129 /div form idform namef action//www.baidu.com/s classfm input typehidden namebdorz_come value1 input typehidden nameie valueutf-8 input typehidden namef value8 input typehidden namersv_bp value1 input typehidden namersv_idx value1 input typehidden nametn valuebaiduspan classbg s_ipt_wrinput idkw namewd classs_ipt value maxlength255 autocompleteoff autofocus/spanspan classbg s_btn_wrinput typesubmit idsu value百度一下 classbg s_btn/span /form /div /div div idu1 a hrefhttp://news.baidu.com nametj_trnews classmnav新闻/a a hrefhttp://www.hao123.com nametj_trhao123 classmnavhao123/a a hrefhttp://map.baidu.com nametj_trmap classmnav地图/a a hrefhttp://v.baidu.com nametj_trvideo classmnav视频/a a hrefhttp://tieba.baidu.com nametj_trtieba classmnav贴吧/a noscript a hrefhttp://www.baidu.com/bdorz/login.gif?logintplmnuhttp%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 nametj_login classlb登录/a /noscript scriptdocument.write(a hrefhttp://www.baidu.com/bdorz/login.gif?logintplmnu encodeURIComponent(window.location.href (window.location.search ? ? : ) bdorz_come1) nametj_login classlb登录/a);/script a href//www.baidu.com/more/ nametj_briicon classbri styledisplay: block;更多产品/a /div /div /div div idftCon div idftConw p idlh a hrefhttp://home.baidu.com关于百度/a a hrefhttp://ir.baidu.comAbout Baidu/a /p p idcp©2017 Baidu a hrefhttp://www.baidu.com/duty/使用百度前必读/a a hrefhttp://jianyi.baidu.com/ classcp-feedback意见反馈/a 京ICP证030173号 img src//www.baidu.com/img/gs.gif /p /div /div /div /body /html2018-03-15 10:50:02 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-15 10:50:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{downloader/request_bytes: 212,downloader/request_count: 1,downloader/request_method_count/GET: 1,downloader/response_bytes: 1476,downloader/response_count: 1,downloader/response_status_count/200: 1,finish_reason: finished,finish_time: datetime.datetime(2018, 3, 15, 2, 50, 2, 425038),log_count/DEBUG: 2,log_count/INFO: 7,memusage/max: 44892160,memusage/startup: 44892160,response_received_count: 1,scheduler/dequeued: 1,scheduler/dequeued/memory: 1,scheduler/enqueued: 1,scheduler/enqueued/memory: 1,start_time: datetime.datetime(2018, 3, 15, 2, 50, 2, 241466)}
2018-03-15 10:50:02 [scrapy.core.engine] INFO: Spider closed (finished) ##执行这条命令时我出现了一个错误ImportError: No module named _sqlite3##解决的办法是yum安装sqlite-devel然后重新编译安装python yum install -y sqlite-devel
cd /usr/local/src/Python-3.4.2
./configure prefix/usr/local/python3
make make install
ln -fs /usr/local/python3/bin/python3 /usr/bin/python ##注意如果在scrapy项目目录之外执行这条命令会使用scrapy默认的爬虫来进行爬取如果在scrapy项目目录内运行命令则会调用该项目的爬虫来进行网页的爬取。##可以通过scrapy fetch -h 来查看命令参数 [rootaliyun ~]# scrapy fetch -h
Usage
scrapy fetch [options] urlFetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable loggingOptions--help, -h show this help message and exit
--spiderSPIDER use this spider
--headers print response HTTP headers instead of body
--no-redirect do not handle HTTP 3xx status codes and print responseas-isGlobal Options
--------------
--logfileFILE log file. if omitted stderr will be used
--loglevelLEVEL, -L LEVELlog level (default: DEBUG)
--nolog disable logging completely
--profileFILE write python cProfile stats to FILE
--pidfileFILE write process ID to FILE
--setNAMEVALUE, -s NAMEVALUEset/override setting (may be repeated)
--pdb enable pdb on failure 通过headers可以获取网页的头部信息通过logfile可以指定日志文件的存储nolog可以控制不显示运行爬取的日志spider可以控制用哪个爬虫loglevel控制日志的等级。##通过headers来获取网页的头部信息nolog参数不显示爬取过程的日志。 [rootaliyun ~]# scrapy fetch --headers --nolog http://www.baidu.comUser-Agent: Scrapy/1.5.0 (https://scrapy.org)Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8Accept-Language: enAccept-Encoding: gzip,deflateContent-Type: text/htmlLast-Modified: Mon, 23 Jan 2017 13:28:28 GMTCache-Control: private, no-cache, no-store, proxy-revalidate, no-transformServer: bfe/1.0.8.18Date: Thu, 15 Mar 2018 03:15:23 GMTPragma: no-cacheSet-Cookie: BDORZ27315; max-age86400; domain.baidu.com; path/ 通过使用fetch可以很方便显示出爬取网页的过程。2、runspider命令scrapy使用runspider命令可以实现不用scrapy项目直接运行一个爬虫文件。 3、setting命令setting可以查看scrapy对应的配置信息如果在scrapy项目目录内使用就是查看项目的配置信息如果在全局使用那么就是查看默认配置信息。##可以通过--get BOT_NAME来查看对应的scrapy信息通过再项目目录执行以及在全局运行。 [rootaliyun test_scrapy]# cd /python/test_scrapy/myfirstpjt/
[rootaliyun myfirstpjt]# scrapy settings --get BOT_NAME
myfirstpjt
[rootaliyun myfirstpjt]# cd
[rootaliyun ~]# scrapy settings --get BOT_NAME
scrapybot 4、shell命令shell可以启动scrapy的交互终端scrapy shell常常在开发以及测试时候使用。##在全局下执行 5、startproject命令用于创建scrapy项目。scrapy startproject projectname6、version命令version命令可以显示scrapy的版本 [rootaliyun ~]# scrapy version
Scrapy 1.5.0
##其他相关版本信息
[rootaliyun ~]# scrapy version -v
Scrapy : 1.5.0
lxml : 4.1.1.0
libxml2 : 2.9.1
cssselect : 1.0.3
parsel : 1.4.0
w3lib : 1.19.0
Twisted : 17.9.0
Python : 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
pyOpenSSL : 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017)
cryptography : 2.1.4
Platform : Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core 7、view命令view可以下载网页并且直接用浏览器查看scrapy view url 二、项目命令##项目命令要在项目的目录下运行1、bench命令bench能测试本地硬件的性能 [rootaliyun myfirstpjt]# scrapy bench
……
2018-03-16 14:56:22 [scrapy.extensions.logstats] INFO: Crawled 255 pages (at 1500 pages/min), scraped 0 items (at 0 items/min)
2018-03-16 14:56:23 [scrapy.extensions.logstats] INFO: Crawled 279 pages (at 1440 pages/min), scraped 0 items (at 0 items/min)
2018-03-16 14:56:24 [scrapy.extensions.logstats] INFO: Crawled 303 pages (at 1440 pages/min), scraped 0 items (at 0 items/min)
……
##从返回中看到每分钟大概能爬取1440个页面 2、genspider命令genspider可以用来创建scrapy爬虫文件这是一种快速创建爬虫文件的方式。##查看当前可以使用的爬虫模板 [rootaliyun myfirstpjt]# scrapy genspider -l
Available templates:basiccrawlcsvfeedxmlfeed ##基于其中一个模板创建一个爬虫文件scrapy genspider -t 模板 新爬虫名 新爬虫爬取的域名 [rootaliyun myfirstpjt]# scrapy genspider -t basic test www.baidu.com
Created spider test using template basic in module:myfirstpjt.spiders.test ##在项目目录内能看到创建的test.py文件里面已经写好了域名。 [rootaliyun myfirstpjt]# cd myfirstpjt/
[rootaliyun myfirstpjt]# ls
__init__.py items.py middlewares.py pipelines.py __pycache__ settings.py spiders
[rootaliyun myfirstpjt]# cd spiders/
[rootaliyun spiders]# ls
__init__.py __pycache__ test.py
[rootaliyun spiders]# cat test.py
# -*- coding: utf-8 -*-
import scrapyclass TestSpider(scrapy.Spider):name testallowed_domains [www.baidu.com]start_urls [http://www.baidu.com/]def parse(self, response):pass 3、check命令check命令可以对爬虫文件进行一种交互式的检查。scrapy check 爬虫名 ##检查爬虫文件检查通过
[rootaliyun myfirstpjt]# scrapy check test----------------------------------------------------------------------
Ran 0 contracts in 0.000sOK 4、crawl命令crawl命令可以启动某个爬虫。scrapy crawl 爬虫名 [rootaliyun myfirstpjt]# scrapy crawl test --loglevelINFO
2018-03-16 18:35:39 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: myfirstpjt)
2018-03-16 18:35:39 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.4.2 (default, Mar 15 2018, 10:26:10) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2k-fips 26 Jan 2017), cryptography 2.1.4, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-03-16 18:35:39 [scrapy.crawler] INFO: Overridden settings: {ROBOTSTXT_OBEY: True, LOG_LEVEL: INFO, SPIDER_MODULES: [myfirstpjt.spiders], BOT_NAME: myfirstpjt, NEWSPIDER_MODULE: myfirstpjt.spiders}
……start_time: datetime.datetime(2018, 3, 16, 10, 35, 39, 671815)}
2018-03-16 18:35:39 [scrapy.core.engine] INFO: Spider closed (finished) 5、list命令list命令可以列出当前使用的爬虫文件。 [rootaliyun myfirstpjt]# scrapy list
test 6、edit命令edit命令可以直接编辑某个爬虫文件在linux中使用比较好。 [rootaliyun myfirstpjt]# scrapy edit test 7、parse命令parse命令可以实现获取指定的URL网址并使用对应的爬虫文件进行处理和分析。 [rootaliyun myfirstpjt]# scrapy parse http://www.baidu.com --nolog STATUS DEPTH LEVEL 0
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[]转载于:https://blog.51cto.com/lsfandlinux/2087747