Getting started with python crawler (two) Opener and Requests

Getting started with python crawler (two) Opener and Requests

Handler and Opener

Handler processor and custom Opener

Opener is an instance of urllib2.OpenerDirector. We have been using urlopen before. It is a special opener (that is, we built).

But the urlopen() method does not support other advanced HTTP/GTTPS functions such as proxy and cookies. All to support these functions:

  1. Use the relevant Handler processor to create a processor object with a specific function;

  2. Then use these processor objects through the urllib2.build_opener() method to create a custom opener object;

  3. Use a custom opener object and call the open() method to send a request.

If all requests in the program use a custom opener, you can use urllib2.install_open() to define the custom opener object as a global opener, which means that if you call urlopen later, you will use this opener (choose according to your own needs) )

Simple custom opener()

# _*_ coding:utf-8 _*_
import urllib2

# Build an HTTPHandler processor object to support processing HTTP requests
http_handler = urllib2.HTTPHandler()
# Call the build_opener() method to build a custom opener object, the parameter is the built processor object
opener = urllib2.build_opener(http_handler)

request = urllib2.Request('http://www.baidu.com/s')

debug log mode

You can add parameters in HTTPHandler (debuglevel=1) to open

# _*_ coding:utf-8 _*_
import urllib2

# Build an HTTPHandler processor object to support processing HTTP requests
# http_handler = urllib2.HTTPHandler()
# Mainly used for debugging
http_handler = urllib2.HTTPHandler(debuglevel=1)
# Call the build_opener() method to build a custom opener object, the parameter is the built processor object
opener = urllib2.build_opener(http_handler)

request = urllib2.Request('http://www.baidu.com/s')
response = opener.open(request)
# print response.read()

ProxyHandler processor (proxy settings)

Use a proxy IP, which is the second big trick for crawlers/anti-crawlers, and it's usually the best to use.

Many websites will detect the number of visits to a certain IP in a certain period of time (through traffic statistics, system logs, etc.). If the number of visits is not like a normal person, it will prohibit access to this IP.

So we can set up some proxy servers and change a proxy every so often, even if the IP is banned, we can still change the IP and continue crawling.

ProxyHandler is used to set the proxy server in urllib2, and use a custom opener to use the proxy:

Free agency website: http://www.xicidaili.com/; https://www.kuaidaili.com/free/inha/

# _*_ coding:utf-8 _*_
import urllib2

# Construct a Handler processor object, the parameter is a dictionary type, including proxy type and proxy server IP+Port
httpproxy_handler = urllib2.ProxyHandler({'http':'118.114.77.47:8080'})
#Use proxy
opener = urllib2.build_opener(httpproxy_handler)
request = urllib2.Request('http://www.baidu.com/s')

#1 If you write this, only use the opener.open() method to send a request to use a custom proxy, and urlopen() does not use a custom proxy.
response = opener.open(request)

#12 If you write like this, you will apply the opener globally, and all subsequent requests, whether it is opener.open() or urlopen(), will use a custom proxy.
#urllib2.install_opener(opener)
#response = urllib2.urlopen(request)

print response.read()

However, these free and open proxies are generally used by many people, and the proxies have short life span, slow speed, low anonymity, unstable HTTP/HTTPS support and other shortcomings (free and not good products), so professional crawler engineers will Use high-quality private proxy

 Private proxy

(The proxy server has a username and password) You must authorize before you can use it

# _*_ coding:utf-8 _*_
import urllib2

#You must enter the username and password, ip and port
authproxy_handler = urllib2.ProxyHandler({'http':'username:pwd@ip:port})
opener = urllib2.build_opener(authproxy_handler)
request = urllib2.Request('http://www.baidu.com/s')
response = opener.open(request)
print response.read()

For safety, generally save the private proxy ip username and password into the system environment variable, and then read it out

# _*_ coding:utf-8 _*_
import urllib2
import os 

# Get the user name and password from the environment variables
proxyuser = os.environ.get('proxuser')   
proxypasswd = os.environ.get('proxpasswd')
#You must enter the username and password, ip and port
authproxy_handler = urllib2.ProxyHandler({'http':proxyuser+':'+proxypasswd+'@ip:port'})
opener = urllib2.build_opener(authproxy_handler)
request = urllib2.Request('http://www.baidu.com/s')
response = opener.open(request)
print response.read()

 Cookielib library and HTTPCookieProcess processor

 Cookie: refers to a text file stored on the user's browser by some website servers in order to identify the user's identity and perform Session tracking. Cookie can keep the login information until the user's next session with the server.

cookielibModule: The main function is to provide an object for storing cookies

HTTPCookieProcessorProcessor: The main function is to process these cookie objects and construct a handler object.

cookie library

The main objects of this module are CookieJar, FileCookieJar, MozillaCookieJar, LWPCookieJar.

  • CookieJar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory, and the cookie will also be lost after garbage collection of the CookieJar instance.
  • FileCookieJar (filename,delayload=None,policy=None): Derived from CookieJar, used to create FileCookieJar instance, retrieve cookie information and store cookie in file. filename is the name of the file where the cookie is stored. When delayload is True, it supports delayed access to the file, that is, it reads the file or stores data in the file only when needed.
  • MozillaCookieJar (filename,delayload=None,policy=None): Derived from FileCookieJar, created andMozilla浏览器 cookies.txt兼容 an instance of FileCookieJar.
  • LWPCookieJar (filename,delayload=None,policy=None): Derived from FileCookieJar to create a libwww-perl标准的 Set-Cookie3 文件格式compatible FileCookieJar instance.

In fact, in most cases, we only use CookieJar(), if we need to interact with local files, we use MozillaCookjar() or LWPCookieJar()

Let's log in to Renren through an example to learn the usage of cookieJar()

Login Renren

# _*_ coding:utf-8 _*_
import urllib2
import urllib
import cookielib

#Build a cookieJar object through the CookieJar() class to save the value of the cookie
cookie = cookielib.CookieJar()
#Build a processor object through the HTTPCookieProcessor() processor class to process cookies
#The parameter is the constructed CookieJar() object
cookie_handler = urllib2.HTTPCookieProcessor(cookie)
#Build a custom opener
opener = urllib2.build_opener(cookie_handler)
# By customizing the opener's addheaders parameters, you can add HTTP header parameters
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36')]
#renren网’s login interface
url ='http://www.renren.com/PLogin.do'
#Require login account password
data = {'email':'15********','password':'py********'}
# Transform by urlencode() encoding
data = urllib.urlencode(data)
# The first time is a POST request, send the parameters required for login, and get the cookie
request = urllib2.Request(url,data = data)
response = opener.open(request)
print response.read()

With cookies, you can directly crawl other pages

# _*_ coding:utf-8 _*_
import urllib2
import urllib
import cookielib

#Build a cookieJar object through the CookieJar() class to save the value of the cookie
cookie = cookielib.CookieJar()
#Build a processor object through the HTTPCookieProcessor() processor class to process cookies
#The parameter is the constructed CookieJar() object
cookie_handler = urllib2.HTTPCookieProcessor(cookie)
#Build a custom opener
opener = urllib2.build_opener(cookie_handler)
# By customizing the opener's addheaders parameters, you can add HTTP header parameters
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36')]
#renren网’s login interface
url ='http://www.renren.com/PLogin.do'
#Require login account password
data = {'email':'15********','password':'python********'}
# Transform by urlencode() encoding
data = urllib.urlencode(data)

request = urllib2.Request(url,data = data)
response = opener.open(request)
# print response.read()

# You can directly crawl other pages after login
response_other = opener.open('http://friend.renren.com/managefriends')
print response_other.read()

Requests module

Installation: direct pip install requests

Requests inherits all the features of urllib2. Requests supports HTTP connection retention and connection pooling, supports the use of cookies to maintain sessions, supports file uploads, supports automatic determination of the encoding of response content, and supports internationalized URLs and automatic encoding of POST data.

Add headers and query parameters

# _*_ coding:utf-8 _*_

import requests

kw = {'wd':'python'}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}

# params Receive a dictionary or string query parameters, the dictionary type is automatically converted to url encoding, no urlencode() is required
response = requests.get("http://www.baidu.com/s?", params = kw, headers = headers)

# View the response content, response.text returns data in Unicode format
print response.text

# View the response content, the byte stream data returned by response.content
print response.content

# View the complete url address
print response.url

# # View response header character encoding
print response.encoding

# View response code
print response.status_code

Use proxy

# _*_ coding:utf-8 _*_

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
# According to the type of agreement, choose different agents
proxies = {
  "http": "http://119.28.152.208:80",
}

response = requests.get("http://www.baidu.com/", proxies = proxies,headers=headers)
print response.text

Private proxy verification

The urllib2 approach here is more complicated, and requests only need one step:

# _*_ coding:utf-8 _*_

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}

proxy = {"http": "name:pwd@ip:port"}

response = requests.get("http://www.baidu.com/", proxies = proxy,headers=headers)

print response.text

web client authentication

import requests

auth=('test', '123456')

response = requests.get('http://192.168.xxx.xx', auth = auth)

print response.text

session

In requests, the session object is a very commonly used object. This object represents a user session: from the time the client browser connects to the server, to when the client browser disconnects from the server.

Session allows us to keep certain parameters across requests, such as keeping cookies between all requests made by the same Session instance.

Login Renren

# _*_ coding:utf-8 _*_

import requests

# 1. Create a session object, you can save the cookie value
ssion = requests.session()

# 2. Processing headers
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}

# 3. The username and password to log in
data = {"email":"158xxxxxxxx", "password":"pythonxxxxxxx"}

# 4. Send a request with username and password, and get the cookie value after login, and save it in ssion
ssion.post("http://www.renren.com/PLogin.do", data = data)

# 5. ssion contains the cookie value after the user logs in, you can directly visit those pages that can only be accessed after logging in
response = ssion.get("http://zhibo.renren.com/news/108")

# 6. Print response content
print response.text

Page analysis and data processing

 The crawler has four main steps:

  1. Clear goals (know which area or website you are going to search for)
  2. Crawl (crawl all the content of all websites)
  3. Take (remove data that is useless to us)
  4. Process data (store and use in the way we want)

Generally speaking, what we need to crawl is the content of a certain website or certain application to extract useful value. The content is generally divided into two parts, unstructured data and structured data.

Unstructured data: there is data first, then structure

Structured data: structure first, data second

1. Unstructured data processing

1. Text, phone number, email address  

    -->Regular expression

2. HTML file   

     -->Regular expression, XPath, CSS selector

2. Structured data processing

1.JSON file 

    -->JSON Path

    -->Convert into python type for operation

2.XML file

    -->Convert into python type (xmltodict)

    -->XPath

    -->CSS selector

    -->Regular expression

Regular expression

A brief review of some uses of python regular expressions

Regular expression testing website: http://tool.oschina.net/regex/#

The general steps to use the re module are as follows:

  1. Use the  compile() function to compile the string form of the regular expression into an  Pattern object
  2. by Pattern  text is matched and searched a series of methods provided object, and the matching result is obtained, a Match object.
  3. Finally, use  Match the properties and methods provided by the object to obtain information, and perform other operations as needed
pattern = re.compile('\d') #Compile the regular expression into a pattern rule object

pattern.match() #From the starting position to search backwards, return the first one that meets the rule, only match once
pattern.search() #Start from any position and search backward, return the first one that meets the rule, and only match once
pattern.findall() #All all matches, return list
pattern.finditer() #All all matches, what is returned is an iterator
pattern.split() #Split string, return list
pattern.sub() #Replace

1.match()

import re

pattern = re.compile('\d+')

m = pattern.match('aaa123bbb456',3,5) #You can specify the start and end positions of match match (string, begin, end)
print m.group() #12

m = pattern.match('aaa123bbb456',3,6)
print m.group() #123
import re
#Match two groups, re.I ignore case
pattern = re.compile(r"([az]+) ([az]+)",re.I) #The first group (letters) and the second group (letters) are separated by spaces
m = pattern.match("Hello world and Python")

print m.group(0) #Hello world group(0) Get all substrings
print m.group(1) #Hello group(1) The first substring in all substrings
print m.group(2) #world group(2) The second substring in all substrings

2.search()

import re

pattern = re.compile(r'\d+')
m = pattern.search('aaa123bbb456')
print m.group() #123

m = pattern.search('aaa123bbb456',2,5)
print m.group() #12

3.findall()

import re

pattern = re.compile(r'\d+')
m = pattern.findall('hello 123456 789') #
print m #['123456', '789']

m = pattern.findall('hello 123456 789',5,10)
print m #['1234']

4.split()

# _*_ coding:utf-8 _*_

import re

pattern = re.compile(r'[\s\d\\\;]+') #Split by spaces, numbers,'\',';'

m = pattern.split(r'a b22b\cc;d33d ee')

print m #['a','b','b','cc','d','d','ee']   

5.sub()

# _*_ coding:utf-8 _*_

import re

pattern = re.compile(r'(\w+) (\w+)')
str ='good 111,job 222'

m = pattern.sub('hello python',str)

print m #hello python,hello python

m = pattern.sub(r"'\1':'\2'",str)

print m #'good':'111','job':'222'
# _*_ coding:utf-8 _*_

import re

pattern = re.compile(r'\d+')
str ='a1b22c33d4e5f678'

m = pattern.sub('*',str) #a*b*c*d*e*f* Replace the number with'*'
print m

Examples of connotative paragraphs

 Crawl all the contents of the post bar, and use regular expressions to crawl out all the paragraphs

url change

  • First page url: http://www.neihan8.com/article/list_5_1 .html
  • The second page url: http://www.neihan8.com/article/list_5_2 .html
  • Third page url: http://www.neihan8.com/article/list_5_3 .html
pattern = re.compile('<dd\sclass="content">(.*?)</dd>', re.S)
The content of each paragraph is in <dd class='content'>......</dd>, which can be obtained through regular
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import re

class Spider:
    def __init__(self):
        # Initialize the starting page position
        self.page = 1
        # Crawl switch, if True, continue to crawl
        self.switch = True

    def loadPage(self):
        """
            Role: download page
        """
        print "Downloading data...."
        url = "http://www.neihan.net/index_" + str(self.page) + ".html"
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
        request = urllib2.Request(url, headers = headers)
        response = urllib2.urlopen(request)

        # Get the HTML source string of each page
        html = response.read()
        #print html

        # Create a regular expression rule object to match the paragraph content in each page, re.S means to match all the string content
        pattern = re.compile('<dd\sclass="content">(.*?)</dd>', re.S)

        # Apply the regular matching object to the html source code string, and return a list of all the paragraphs in this page
        content_list = pattern.findall(html)

        # Call dealPage() to deal with the miscellaneous in the paragraph
        self.dealPage(content_list)

    def dealPage(self, content_list):
        """
            Process the paragraphs of each page
            content_list: A collection of paragraph sublists for each page
        """
        for item in content_list:
            # Treat each paragraph in the collection one by one, replacing useless data
            item = item.replace("<p>","").replace("</p>", "").replace("<br/>", "")
            # After processing, call writePage() to write each paragraph into the file
            self.writePage(item)

    def writePage(self, item):
        """
            Write each paragraph into the file one by one
            item: each paragraph after processing
        """
        # Write to file
        print "Writing data..."
        with open("tieba.txt", "a") as f:
            f.write(item)

    def startWork(self):
        """
            Control crawler operation
        """
        # Execute in a loop until self.switch == False
        while self.switch:
            # The user determines the number of crawls
            self.loadPage()
            command = raw_input("If you continue to crawl, please press Enter (to exit and enter quit)")
            if command == "quit":
                # If you stop crawling, enter quit
                self.switch = False
            # Each cycle, the page number is incremented by 1
            self.page += 1
        print "Thank you for using!"


if __name__ == "__main__":
    duanziSpider = Spider()
    duanziSpider.startWork()

You can press Enter to crawl the next page, and enter QUIT to exit.

 Content after crawling:

Reference: https://cloud.tencent.com/developer/article/1091695 Python crawler introduction (2) Opener and Requests-Cloud + Community-Tencent Cloud