Scrapy爬取名人名言网


  1. 通过scrapy startproject <project_name>来创建项目,产生如下所示的文件结构

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    demo:.						————根目录
    │ scrapy.cfg ————配置文件
    └─demo
    │ items.py ————定义Item数据结构
    │ middlewares.py ————定义Spider Middlewares
    │ pipelines.py ————定义Item Pipeline
    │ settings.py ————项目全局配置
    │ __init__.py
    └─spiders ————包含所有的Spider具体实现
    __init__.py
  2. 通过scrapy genspider <spider_name> <url>创建一个继承了scrapy.Spider类的Spider,如,定义三个属性

  • name:用于区别Spider

  • start_urls:包含了启动时爬取的url列表

  • parse()方法:负责解析返回的数据、提取数据生成Item并生成需要进一步处理的URL的Requeset对象

  • 举例:scrapy genspider quotes quotes.toscrape.com,发现scrapy文件夹中多了一个quotes.py,内容如下

    1
    2
    3
    4
    5
    6
    7
    8
    import scrapy
    class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
    pass
  1. 创建Item,在items.py中添加一个QuoteItem类,此时item.py应该是如下,定义了需要爬取的容器

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    import scrapy
    class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

    class QuoteItem(scrapy.Item):
       text = scrapy.Field()
       author = scrapy.Field()
       tags = scrapy.Field()
  2. 在spider中引用创建的Item,并修改parse方法,yield每个item,此时的quotes.py如下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    import scrapy
    from bs4 import BeautifulSoup
    from demo.items import QuoteItem
    class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
    soup = BeautifulSoup(response.text,'lxml')
    quotes = soup.find_all('div',class_='quote')

    for quote in quotes:
    item = QuoteItem()
    quoteSoup = BeautifulSoup(str(quote),'lxml')
    item['text'] = quoteSoup.find('span',class_='text').string
    item['author'] = quoteSoup.find('small').string
    item['tags'] = [tag.string for tag in quoteSoup.find_all('a',class_='tag')]
    yield item
  3. 让parse自动爬取下一页

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    import scrapy
    from bs4 import BeautifulSoup
    from demo.items import QuoteItem
    class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
    soup = BeautifulSoup(response.text,'lxml')
    quotes = soup.find_all('div',class_='quote')

    for quote in quotes:
    item = QuoteItem()
    quoteSoup = BeautifulSoup(str(quote),'lxml')
    item['text'] = quoteSoup.find('span',class_='text').string
    item['author'] = quoteSoup.find('small').string
    item['tags'] = [tag.string for tag in quoteSoup.find_all('a',class_='tag')]
    yield item

    nextPage = soup.find('nav').find_all('a')[-1].get('href')
    nextUrl = response.urljoin(nextPage)
    yield scrapy.Request(url=nextUrl, callback=self.parse)
  4. 如果想保存为json文件,可以直接使用scrapy crawl <spider_name> -o items.json

  5. 通过Item Pipeline将爬取到的数据存入Mongodb

    1. 修改settings.py,设置管道优先级和数据库相关信息

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      BOT_NAME = 'demo'

      SPIDER_MODULES = ['demo.spiders']
      NEWSPIDER_MODULE = 'demo.spiders'

      ITEM_PIPELINES = {
      'demo.pipelines.quoteToMongo': 300,
      }
      MONGO_URI='localhost'
      MONGO_DB='study'

      ROBOTSTXT_OBEY = True
      • ITEM_PIPELINES定义了pipeline的优先级,越小越先被调用
    2. 修改pipelines.py添加类quoteToMongo,设置与Mongodb的交互,如下

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      from itemadapter import ItemAdapter
      import pymongo

      class DemoPipeline:
      def process_item(self, item, spider):
      return item

      class quoteToMongo(object):
      def __init__(self,mongo_uri,mongo_db):
      self.mongo_uri = mongo_uri
      self.mongo_db = mongo_db

      @classmethod
      def from_crawler(cls, crawler):
      return cls(
      mongo_uri=crawler.settings.get('MONGO_URI'),
      mongo_db=crawler.settings.get('MONGO_DB'))

      def open_spider(self, spider):
      self.client = pymongo.MongoClient(self.mongo_uri)
      self.db = self.client[self.mongo_db]

      def process_item(self, item, spider):
      self.db['quote'].insert(dict(item))
      return item

      def close_spider(self, spider):
      self.client.close()
      • from_crawler:这是一个类方法,用 @classmethod 标识,是一种依赖注入的方式,方法的参数就是 crawler,通过 crawler 这个参数我们可以拿到全局配置的每个配置信息,在全局配置 settings.py 中我们可以定义 MONGO_URI 和 MONGO_DB 来指定 MongoDB 连接需要的地址和数据库名称,拿到配置信息之后返回类对象即可。所以这个方法的定义主要是用来获取 settings.py 中的配置的
      • open_spider:当 Spider 被开启时,这个方法被调用。在这里主要进行了一些初始化操作
      • close_spider:当 Spider 被关闭时,这个方法会调用,在这里将数据库连接关闭
    3. 执行爬取,Mongodb中成功看到想要的数据