Getting started with python crawler (5) Selenium simulates user operations

Getting started with python crawler (5) Selenium simulates user operations

Spider, Anti-Spider, Anti-Anti-Spider, a magnificent battle...

  • Xiao Mo wanted all the movies on a certain station, wrote a standard crawler (based on the HttpClient library), constantly traversed the movie list page of a certain station, analyzed the movie names according to Html and stored them in his database.
  • Xiaoli, the operation and maintenance of this site, found that the number of requests increased sharply during a certain period of time. Analysis of the log found that the user was IP (xxx.xxx.xxx.xxx), and the user-agent was still Python-urllib/2.7, based on these two points. After judging non-human beings, block directly on the server.
  • Xiaomo movie only climbed half of the time, so he also changed the strategy accordingly: 1. user-agent imitated Baidu ("Baiduspider..."), 2. IP agent changed every half an hour.
  • Xiaoli also found the corresponding change, so he set a frequency limit on the server, and blocked the IP if it exceeded 120 requests per minute. At the same time, considering that Baidu’s crawlers may be accidentally injured, think about the hundreds of thousands of monthly releases in the marketing department, so I wrote a script to check whether this ip is real Baidu’s through the hostname, and set one for these ips. whitelist.
  • After Xiao Mo discovered the new restrictions, thinking that I would not be in a hurry to ask for these data, and leave it to the server to crawl slowly, so he modified the code and crawled every 1-3 seconds randomly, crawled 10 times and rested for 10 seconds, only Climb at 8-12 and 18-20, and take a break every few days.
  • Xiaoli saw that the new log headers were all big, and accidentally setting rules would accidentally hurt real users, so he planned to change his mind. When the total number of requests in 3 hours exceeds 50 times, a verification code pop-up box will pop up. If it is entered correctly, the IP will be recorded in the blacklist.
  • Xiao Mo was a little silly when he saw the verification code, but it was not impossible. 1. he learned image recognition (keywords PIL, tesseract), and then binarized the verification code, word segmentation, pattern training, and finally recognized. I got Xiaoli’s verification code (about verification code, verification code identification, verification code anti-recognition is also a magnificent history of struggle...), and then the crawler ran again.
  • Xiaoli is a good and unrelenting student. After seeing that the verification code was breached, he discussed with the development students about the development mode. The data is no longer directly rendered, but asynchronously obtained by the front-end students, and through the JavaScript encryption library A dynamic token is generated, and the encryption library is then obfuscated.
  • Is there no way for the obfuscated encryption library? Of course not, you can slowly debug to find the encryption principle, but Xiaomo is not prepared to use such a time-consuming and labor-intensive method. He gave up the crawler based on HttpClient and chose the crawler with the built-in browser engine (keywords: PhantomJS, Selenium). Run the page on the browser engine, get the correct result directly, and once again get the other party's data.
  • Xiao Li: .....

Selenium

Selenium is a web automated testing tool. It was originally developed for automated website testing. The type is like the button wizard we use to play games. It can be operated automatically according to specified commands. The difference is that Selenium can run directly on the browser. It supports All major browsers (including non-interface browsers such as PhantomJS).

Selenium can let the browser automatically load the page according to our instructions, obtain the required data, and even take a screenshot of the page, or determine whether certain actions on the website have occurred.

Selenium does not have a browser and does not support the functions of the browser. It needs to be combined with a third-party browser to use it.

First download the selenium webdriver'geckodriver.exe', and put it in the python directory after downloading

The firefox directory should also be added to the environment variable

There is an API called WebDriver in the Selenium library. WebDriver is a bit like a browser that can load a website, but it can also be used like BeautifulSoup or other Selector objects to find page elements, interact with elements on the page (send text, click, etc.), and perform other actions to run the network reptile.

Selenium quick start

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from selenium import webdriver

# If you want to call the keyboard key operation, you need to import the keys package
from selenium.webdriver.common.keys import Keys

#Create browser object
driver = webdriver.Firefox()

driver.get("http://www.baidu.com")

#Print page title "You will know when you click on Baidu"
print driver.title

#Generate a snapshot of the current page
driver.save_screenshot("baidu.png")

# id="kw" is the Baidu search box, enter the string "weibo" to jump to the search page in China
driver.find_element_by_id("kw").send_keys(u"weibo")

# id="su" is a Baidu search button, click() is a simulated click
driver.find_element_by_id("su").click()

# Get a new page snapshot
driver.save_screenshot(u"微博.png")

# Print the source code of the rendered web page
print driver.page_source

# Get the current page cookie
print driver.get_cookies()

# ctrl+a Select all input box content
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a')

# ctrl+x Cut the content of the input box
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'x')

# Re-enter the content in the input box
driver.find_element_by_id("kw").send_keys("test")

# Simulate Enter key
driver.find_element_by_id("su").send_keys(Keys.RETURN)

# Clear the content of the input box
driver.find_element_by_id("kw").clear()

# Generate a new page snapshot
driver.save_screenshot("test.png")

# Get the current url
print driver.current_url

# Close the current page, if there is only one page, the browser will be closed
# driver.close()

# Close the browser
driver.quit()

1. Page operation

If there is the following input box

<input type="text" name="user-name" id="passwd-id"/>

Find a way

# Get id tag value
element = driver.find_element_by_id("passwd-id")
# Get the name tag value
element = driver.find_element_by_name("user-name")
# Get the tag name value
element = driver.find_elements_by_tag_name("input")
# It can also be matched by XPath
element = driver.find_element_by_xpath("//input[@id='passwd-id']")

2. Method of positioning elements

find_element_by_id
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

3. Mouse action

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from selenium import webdriver

# If you want to call the keyboard key operation, you need to import the keys package
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains

#Create browser object
driver = webdriver.Firefox()

driver.get("http://www.baidu.com")

#Mouse move to somewhere
action1 = driver.find_element_by_id("su")
ActionChains(driver).move_to_element(action1).perform()

#Move to click somewhere
action2 = driver.find_element_by_id("su")
ActionChains(driver).move_to_element(action2).click(action2).perform()

#Mouse move to somewhere double click
action3 = driver.find_element_by_id("su")
ActionChains(driver).move_to_element(action3).double_click(action3).perform()

# Move the mouse to some place and right click
action4 = driver.find_element_by_id("su")
ActionChains(driver).move_to_element(action4).context_click(action4).perform()

4.Select form

When you encounter a drop-down box that requires a selection operation, Selenium provides the Select class to handle the drop-down box

# Import the Select class
from selenium.webdriver.support.ui import Select

# Find the tab of name
select = Select(driver.find_element_by_name('status'))

# 
select.select_by_index(1)
select.select_by_value("0")
select.select_by_visible_text(u"xxx")

The above are three ways to select the drop-down box, which can be selected according to the index, can be selected according to the value, and can be selected according to the text. note:

  • index index starts from 0
  • value is an attribute value of the option tag, not the value displayed in the drop-down box
  • visible_text is the value of the text in the option tag, and is the value displayed in the drop-down box

How to cancel all

select.deselect_all()

5. Pop-up processing

When a pop-up prompt appears on the page

alert = driver.switch_to_alrt()

6. Page switching

A browser must have many windows, so we must have a way to switch between windows. The method of switching windows is as follows:

driver.switch_to.window("this is window name")

7. Page forward and backward

The forward and backward functions of the operation page:

driver.forward() #forward
driver.back() # Back

Example simulated landing on douban website

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Firefox()
driver.get("http://www.douban.com")

# Enter account password
driver.find_element_by_name("form_email").send_keys("158xxxxxxxx")
driver.find_element_by_name("form_password").send_keys("zhxxxxxxxx")

# Simulate click to log in
driver.find_element_by_xpath("//input[@class='bn-submit']").click()

# Wait 3 seconds
time.sleep(3)

# Generate a post-login snapshot
driver.save_screenshot(u"douban.png")

driver.quit()

 Dynamic page simulation click --->>> crawl all the room names and the number of viewers

(1) First analyze the class change of the "next page", if it is not the last page, the class of the "next page" is as follows

(2) If you reach the last page, the "Next Page" becomes hidden, and you can't click it, the class becomes as follows

(3) Find a class with the name of the room and the number of viewers

(4) Code

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import unittest
from selenium import webdriver
from bs4 import BeautifulSoup as bs

class douyu(unittest.TestCase):
    # Initialization method, must be setUp()
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.num = 0
        self.count = 0

    # The test method must start with the word test
    def testDouyu(self):
        self.driver.get("https://www.douyu.com/directory/all")

        while True:
            soup = bs(self.driver.page_source, "lxml")
            # Room name, back to list
            names = soup.find_all("h3", {"class": "ellipsis"})
            # Number of viewers, back to list
            numbers = soup.find_all("span", {"class" :"dy-num fr"})

            # zip(names, numbers) Combine the two lists of name and number into one tuple: [(1, 2), (3, 4)...]
            for name, number in zip(names, numbers):
                print u"Number of viewers: -" + number.get_text().strip() + u"-\t Room name: "+ name.get_text().strip()
                self.num += 1
                #self.count += int(number.get_text().strip())

            # If you find "Next Page" as a hidden tag in the page source code, exit the loop
            if self.driver.page_source.find("shark-pager-disable-next") != -1:
                    break

            # Keep clicking on the next page
            self.driver.find_element_by_class_name("shark-pager-next").click()

    # The method of the end of the test execution
    def tearDown(self):
        # Exit Firefox() browser
        print "The number of live broadcasts on the current website" + str(self.num)
        print "The number of viewers on the current website" + str(self.count)
        self.driver.quit()

if __name__ == "__main__":
    # Start the test module
    unittest.main()

The result of crawling:

Reference: https://cloud.tencent.com/developer/article/1091730 Python crawler introduction (5) Selenium simulation user operation-Cloud + Community-Tencent Cloud