Introduction to Python crawlers (6) Introduction to the principle of the Scrapy framework

Introduction to Python crawlers (6) Introduction to the principle of the Scrapy framework

Scrapy framework

Introduction to Scrapy

  • Scrapy is an application framework written in pure Python for crawling website data and extracting structured data. It has a wide range of uses.
  • With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to grab web content and various pictures, which is very convenient.
  • Scrapy uses Twisted ['twɪstɪd](its main opponent is Tornado) asynchronous network framework to handle network communication, which can speed up our download speed, without having to implement the asynchronous framework by ourselves, and contains various middleware interfaces, which can flexibly complete various needs.

Scrapy architecture

  • Scrapy Engine(引擎): Responsible Spider, ItemPipeline, Downloader, Schedulerin the middle of communications, signals, data transfer and so on.
  • Scheduler(调度器): It is responsible for accepting 引擎the Request requests sent over, sorting them in a certain way, enqueuing 引擎them, and returning them when needed 引擎.
  • Downloader(下载器): Responsible for downloading Scrapy Engine(引擎)all the Requests sent, and returning the Responses it gets, which will be handled Scrapy Engine(引擎)by the 引擎handover Spider,
  • Spider(爬虫): It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, and submitting the URL that needs to be followed up to 引擎enter again Scheduler(调度器),
  • Item Pipeline(管道): It is responsible for processing the Spideritems obtained in the process and performing post-processing (detailed analysis, filtering, storage, etc.).
  • Downloader Middlewares(下载中间件): You can treat it as a component that can customize and extend the download function.
  • Spider Middlewares(Spider中间件): You can understand it as a functional component that can be customized to extend and operate 引擎and Spiderintermediate 通信(such as incoming SpiderResponses; and Spideroutgoing Requests)

Vernacular to explain the operation process of Scrapy

The code is written and the program starts to run...

  1. 引擎: Hi! Spider, Which website do you want to deal with?
  2. Spider: The boss wants me to deal with xxxx.com.
  3. 引擎: Give me the first URL that needs to be processed.
  4. Spider: Here you are, the first URL is xxxxxxx.com.
  5. 引擎: Hi! 调度器, I have a request for you to help me sort into the team.
  6. 调度器: Okay, we are dealing with you, wait a minute.
  7. 引擎: Hi! 调度器, Give me the request you processed.
  8. 调度器: Here you are, this is the request I handled
  9. 引擎: Hi! Downloader, you 下载中间件can download this request for me according to the boss’s setting
  10. 下载器:Ok! Here you are, this is the downloaded thing. (If it fails: sorry, the download of this request failed. Then 引擎tell 调度器, the download of this request failed, you record it, we will download again later)
  11. 引擎: Hi! Spider, This is the downloaded thing, and it has been 下载中间件processed according to the boss's , you can handle it yourself (note! The responses here are def parse()handled by this function by default )
  12. Spider: (After processing the data, the URL that needs to be followed up), Hi! 引擎, I have two results here, this is the URL I need to follow up, and this is the Item data I get.
  13. 引擎: Hi! 管道 I have an item here, please handle it for me! 调度器! This is the URL that needs to be followed up for you to help me deal with it. Then start looping from the fourth step until all the information the boss needs is obtained.
  14. 管道``调度器: Okay, do it now!

Steps to make Scrapy crawler

1. New project

scrapy startproject mySpider
scrapy.cfg: project configuration file

mySpider/: The Python module of the project, the code will be quoted from here

mySpider/items.py: the object file of the project

mySpider/pipelines.py: the pipeline file of the project

mySpider/settings.py: the settings file of the project

mySpider/spiders/: Store crawler code directory

2. Clear goals (mySpider/items.py)

What information do you want to crawl, define structured data fields in the Item, and save the crawled data

3. Make spiders (spiders/xxxxSpider.py)

import scrapy

class ItcastSpider(scrapy.Spider):
    name = "itcast"
    allowed_domains = ["itcast.cn"]
    start_urls = (
        'http://www.itcast.cn/',
    )

    def parse(self, response):
        pass
  • name = "" : The identification name of this crawler must be unique, and different names must be defined for different crawlers.
  • allow_domains = [] It is the scope of the searched domain name, that is, the restricted area of ​​the crawler, which stipulates that the crawler only crawls the web pages under this domain name, and URLs that do not exist will be ignored.
  • start_urls = () : The ancestor/list of crawled URLs. The crawler starts to grab data from here, so the first download of data will start from these urls. Other sub-URLs will be inherited from these start URLs.
  • parse(self, response) : The method of parsing. It will be called after each initial URL is downloaded. When calling, the Response object returned from each URL is passed in as the only parameter. The main functions are as follows:

 4. Save data (pipelines.py)

Set the method of saving data in the pipeline file, which can be saved to the local or database

Reminder

When running the scrapy project for the first time

--> "DLL load failed" error message appears, you need to install the pypiwin32 module

First write a simple example

 (1) items.py

The information you want to crawl

# -*- coding: utf-8 -*-

import scrapy

class ItcastItem(scrapy.Item):
    name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()

(2) itcastspider.py

Write a crawler

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import scrapy
from mySpider.items import ItcastItem

# Create a crawler
class ItcastSpider(scrapy.Spider):
    # Crawler name
    name = "itcast"
    # Scope of allowing crawlers
    allowd_domains = ["http://www.itcast.cn/"]
    # The url of the crawler
    start_urls = [
        "http://www.itcast.cn/channel/teacher.shtml#",
    ]

    def parse(self, response):
        teacher_list = response.xpath('//div[@class="li_txt"]')
        # List collection of all teacher information
        teacherItem = []
        # Traverse the root node collection

        for each in teacher_list:
            # Item object used to save data
            item = ItcastItem()
            # name, extract() Convert the matched result into a Unicode string
            # Without extract(), the result is an xpath matching object
            name = each.xpath('./h3/text()').extract()
            # title
            title = each.xpath('./h4/text()').extract()
            # info
            info = each.xpath('./p/text()').extract()

            item['name'] = name[0].encode("gbk")
            item['title'] = title[0].encode("gbk")
            item['info'] = info[0].encode("gbk")

            teacherItem.append(item)

        return teacherItem

Input command: scrapy crawl itcast -o itcast.csv and save as ".csv" format

Usage of pipeline file pipelines.py

 (1) Modify setting.py

ITEM_PIPELINES = {
  #Set the class written in the pipeline file
   'mySpider.pipelines.ItcastPipeline': 300,
}

(2) itcastspider.py

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import scrapy
from mySpider.items import ItcastItem

# Create a crawler
class ItcastSpider(scrapy.Spider):
    # Crawler name
    name = "itcast"
    # Scope of allowing crawlers
    allowd_domains = ["http://www.itcast.cn/"]
    # The actual url of the crawler
    start_urls = [
        "http://www.itcast.cn/channel/teacher.shtml#aandroid",

    ]

    def parse(self, response):
        #with open("teacher.html", "w") as f:
        # f.write(response.body)
        # Match the root node list set of all teachers through the xpath that comes with scrapy
        teacher_list = response.xpath('//div[@class="li_txt"]')

        # Traverse the root node collection
        for each in teacher_list:
            # Item object used to save data
            item = ItcastItem()
            # name, extract() Convert the matched result into a Unicode string
            # Without extract(), the result is an xpath matching object
            name = each.xpath('./h3/text()').extract()
            # title
            title = each.xpath('./h4/text()').extract()
            # info
            info = each.xpath('./p/text()').extract()

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            yield item

(3)pipelines.py

Save the data locally

# -*- coding: utf-8 -*-
import json

class ItcastPipeline(object):
    # __init__ method is optional, as the initialization method of the class
    def __init__(self):
        # Created a file
        self.filename = open("teacher.json", "w")

    # process_item method must be written to process item data
    def process_item(self, item, spider):
        jsontext = json.dumps(dict(item), ensure_ascii = False) + "\n"
        self.filename.write(jsontext.encode("utf-8"))
        return item

    # close_spider method is optional, call this method at the end
    def close_spider(self, spider):
        self.filename.close()

(4) items.py

# -*- coding: utf-8 -*-

import scrapy

class ItcastItem(scrapy.Item):
    name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()
Reference: https://cloud.tencent.com/developer/article/1091732 Getting started with python crawlers (6) Introduction to the principle of the Scrapy framework-Cloud + Community-Tencent Cloud