Getting started with python crawler (7) Spider class of Scrapy framework

Getting started with python crawler (7) Spider class of Scrapy framework

 Spider class

The Spider class defines how to crawl a certain (or certain) website. Including the crawling action (for example: whether to follow up the link) and how to extract structured data from the content of the web page (crawling item).

In other words, Spider is where you define the crawling action and analyze a certain webpage (or some webpages).

class scrapy.SpiderIt is the most basic class, and all crawlers written must inherit this class.

The main functions and calling sequence are as follows:

__init__() : Initialize the crawler name and start_urls list

start_requests() 调用make_requests_from url(): Generate the Requests object and give it to Scrapy to download and return the response

parse() : Parse the response and return Item or Requests (callback function needs to be specified). Item is passed to Item pipline for persistence, and Requests are downloaded by Scrapy and processed by the specified callback function (default parse()), and the loop continues until all the data is processed.

Source reference

#The base class of all crawlers, user-defined crawlers must inherit from this class
class Spider(object_ref):

    #Define the string of the spider name (string). The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique.
    #name is the most important attribute of spider, and it is required.
    #General practice is to name the spider by the website (domain) (with or without suffix). For example, if a spider crawls mywebsite.com, the spider will usually be named mywebsite
    name = None

    #Initialization, extract the crawler name, start_ruls
    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        # If the crawler does not have a name, an error will be reported if the subsequent operation is interrupted
        elif not getattr(self,'name', None):
            raise ValueError("%s must have a name"% type(self).__name__)

        # Python objects or types store member information through the built-in member __dict__
        self.__dict__.update(kwargs)

        #URLList. When there is no specified URL, the spider will start crawling from this list. Therefore, the URL of the first page to be retrieved will be one of this list. The subsequent URL will be extracted from the obtained data.
        if not hasattr(self,'start_urls'):
            self.start_urls = []

    # Print log information after Scrapy is executed
    def log(self, message, level=log.DEBUG, **kw):
        log.msg(message, spider=self, level=level, **kw)

    # Determine whether the attribute of the object exists, and if it does not exist, do assertion processing
    def set_crawler(self, crawler):
        assert not hasattr(self,'_crawler'), "Spider already bounded to %s"% crawler
        self._crawler = crawler

    @property
    def crawler(self):
        assert hasattr(self,'_crawler'), "Spider not bounded to any crawler"
        return self._crawler

    @property
    def settings(self):
        return self.crawler.settings

    #This method will read the addresses in start_urls, and generate a Request object for each address, and hand it over to Scrapy to download and return the Response
    #The method is only called once
    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    A function that is called in #start_requests() to actually generate the Request.
    The default callback function of #Request object is parse(), and the method of submission is get
    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)

    #The default Request object callback function, which handles the returned response.
    #Generate Item or Request object. The user must implement this class
    def parse(self, response):
        raise NotImplementedError

    @classmethod
    def handles_request(cls, request):
        return url_is_from_spider(request.url, cls)

    def __str__(self):
        return "<%s %r at 0x%0x>"% (type(self).__name__, self.name, id(self))

    __repr__ = __str__

Main attributes and methods

name

A string defining the name of the spider.

For example, if a spider crawls mywebsite.com, the spider will usually be named mywebsite

allowed_domains

Contains the list of domains allowed to be crawled by the spider, optional.

start_urls

The initial URL ancestor/list. When no specific URL is specified, the spider will start crawling from this list.

start_requests(self)

The method must return an iterable object (iterable). This object contains the first Request that the spider uses to crawl (the default implementation is to use the url of start_urls).

This method is called when spider starts crawling and start_urls is not specified.

parse(self, response)

When the request url returns to the web page without specifying a callback function, the default Request object callback function. Used to process the response returned by the web page, and generate Item or Request objects.

Scrapy framework crawling --->>>All job information recruited by Tencent

 1. First analyze the url of Tencent recruitment website

First page: https://hr.tencent.com/position.php?&start=0#a

The second page: https://hr.tencent.com/position.php?&start=10#a

The third page: https://hr.tencent.com/position.php?&start=20#a

 It is found that some job categories are empty. When looking for a job category, the empty value should also be added, otherwise the for loop will directly exit if the value is not obtained./td[2]/text()|./td[2]

2. Directory structure

3.items.py

# -*- coding: utf-8 -*-
import scrapy

class TencentItem(scrapy.Item):
    # Position name
    positionname = scrapy.Field()
    # Details link
    positionlink = scrapy.Field()
    # Job Categories
    positionType = scrapy.Field()
    # Number of recruits
    peopleNum = scrapy.Field()
    # work place
    workLocation = scrapy.Field()
    # release time
    publishTime = scrapy.Field()

4.tencentPosition.py

tencentPosition.py Use the command to create scrapy genspider tencentPosition "tencent.com"

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem

class TencentpositionSpider(scrapy.Spider):
    name = "tencent"
    allowed_domains = ["tencent.com"]

    url = "http://hr.tencent.com/position.php?&start="
    offset = 0

    start_urls = [url + str(offset)]

    def parse(self, response):
        for each in response.xpath("//tr[@class='even'] |//tr[@class='odd']"):
            # Initialize the model object
            item = TencentItem()
       # position Name
            item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
            # Details link
            item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
            # Job Categories
            item['positionType'] = each.xpath("./td[2]/text()|./td[2]").extract()[0]      
            # Number of recruits
            item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]
            # work place
            item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]
            # release time
            item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]

            yield item

        if self.offset <3171:
            self.offset += 10

        # After processing one page of data each time, resend the next page request
        # self.offset is incremented by 10, spliced ​​into a new url at the same time, and the callback function self.parse is called to process the Response
        yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

5.pipelines.py

# -*- coding: utf-8 -*-

import json

class TencentPipeline(object):
    def __init__(self):
        self.filename = open("tencent.json", "w")

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        self.filename.write(text.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.filename.close()

6. Settings in settings.py

ROBOTSTXT_OBEY = True

DOWNLOAD_DELAY = 4 #Prevent data loss by crawling too fast


DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}


ITEM_PIPELINES = {
    'tencent.pipelines.TencentPipeline': 300,
}

The result of crawling

Reference: https://cloud.tencent.com/developer/article/1091737 Python crawler introduction (7) Spider class of the Scrapy framework-Cloud + Community-Tencent Cloud