Getting started with python crawler (three) XPATH and BeautifulSoup4

Getting started with python crawler (three) XPATH and BeautifulSoup4

 XML and XPATH

It is very troublesome to process HTML documents with regularity. We can first convert HTML files into XML documents, and then use XPath to find HTML nodes or elements.

  • XML stands for Extensible Markup Language (EXtensible Markup Language)
  • XML is a markup language, very similar to HTML
  • XML is designed to transmit data, not display data
  • XML tags need to be defined by us.
  • XML is designed to be self-describing.
  • XML is the recommended standard of W3C
<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

Difference between XML and HTML

HTML DOM model example

HTML DOM defines a standard method for accessing and manipulating HTML documents, expressing HTML documents in a tree structure

XPATH

XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.

chrome plugin XPATH HelPer

Firefox add-on XPATH Checker

XPATH syntax

The most commonly used path expressions:

predicate

The predicate is used to find a specific node or a node containing a specified value, and is embedded in square brackets.

In the following table, we list some path expressions with predicates and the results of the expressions:

Select location node

Select a number of roads

 LXML library

Installation: pip install lxml

lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.

Like regular, lxml is also implemented in C. It is a high-performance Python HTML/XML parser that can use XPath syntax to quickly locate specific elements and node information.

 Simple method of use

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from lxml import etree

text ='''
    <div>
        <li>11</li>
        <li>22</li>
        <li>33</li>
        <li>44</li>
    </div>
'''

#Using etree.HTML to parse the string into an HTML document
html = etree.HTML(text)

# Serialize HTML documents by string
result = etree.tostring(html)

print(result)

result:

Crawl beautiful pictures

 1. First find the url collection of each post list

2. Find the complete url link of each picture in each post

3. To use the lxml module to parse html

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib
import urllib2
from lxml import etree

def loadPage(url):
    """
        Function: send request according to url, get server response file
        url: URL address that needs to be crawled
    """
    request = urllib2.Request(url)
    html = urllib2.urlopen(request).read()
    # Parse HTML document into HTML DOM model
    content = etree.HTML(html)
    # Return all successfully matched list collections
    link_list = content.xpath('//div[@class="t_con cleafix"]/div/div/div/a/@href')
    for link in link_list:
        fulllink = "http://tieba.baidu.com" + link
        # Combine a link for each post
        #print link
        loadImage(fulllink)

# Take out each picture link in each post
def loadImage(link):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
    request = urllib2.Request(link, headers = headers)
    html = urllib2.urlopen(request).read()
    # Analysis
    content = etree.HTML(html)
    # Take out the picture connection set sent by the master of each layer in the post
    link_list = content.xpath('//img[@class="BDE_Image"]/@src')
    # Take out the connection of each picture
    for link in link_list:
        # print link
        writeImage(link)

def writeImage(link):
    """
        Role: write html content to the local
        link: picture link
    """
    #print "Saving" + filename
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    # File writing
    request = urllib2.Request(link, headers = headers)
    # Picture original data
    image = urllib2.urlopen(request).read()
    # Take out the 10 digits after the connection as the file name
    filename = link[-10:]
    # Write to the local disk file
    with open(filename, "wb") as f:
        f.write(image)
    print "downloaded successfully" + filename

def tiebaSpider(url, beginPage, endPage):
    """
        Role: Tieba crawler scheduler, responsible for combining and processing the url of each page
        url: the first part of the post bar url
        beginPage: start page
        endPage: end page
    """
    for page in range(beginPage, endPage + 1):
        pn = (page-1) * 50
        #filename = "第" + str(page) + "页.html"
        fullurl = url + "&pn=" + str(pn)
        #print fullurl
        loadPage(fullurl)
        #print html

        print "Thank you for using"

if __name__ == "__main__":
    kw = raw_input("Please enter the name of the post that needs to be crawled:")
    beginPage = int(raw_input("Please enter the starting page:"))
    endPage = int(raw_input("Please enter the end page:"))

    url = "http://tieba.baidu.com/f?"
    key = urllib.urlencode({"kw": kw})
    fullurl = url + key
    tiebaSpider(fullurl, beginPage, endPage)

4. All the crawled pictures are saved to the computer

 CSS selector: BeautifulSoup4

Like lxml, Beautiful Soup is also an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.

lxml can only traverse partially, while Beautiful Soup is based on HTML DOM, it will load the entire document and parse the entire DOM tree, so the time and memory overhead will be much larger, so the performance is lower than lxml. BeautifulSoup is relatively simple to parse HTML, and its API is very user- friendly . It supports CSS selectors , HTML parsers in the Python standard library, and XML parsers for lxml. The development of Beautiful Soup 3 has been stopped, and it is recommended to use Beautiful Soup 4 for current projects. Use pip to install:pip install beautifulsoup4

Use Beautifulsoup4 to crawl Tencent job information

from bs4 import BeautifulSoup
import urllib2
import urllib
import json # Use json format storage

def tencent():
    url ='http://hr.tencent.com/'
    request = urllib2.Request(url +'position.php?&start=10#a')
    response =urllib2.urlopen(request)
    resHtml = response.read()

    output =open('tencent.json','w')

    html = BeautifulSoup(resHtml,'lxml')

# Create CSS selector
    result = html.select('tr[class="even"]')
    result2 = html.select('tr[class="odd"]')
    result += result2

    items = []
    for site in result:
        item = {}

        name = site.select('td a')[0].get_text()
        detailLink = site.select('td a')[0].attrs['href']
        catalog = site.select('td')[1].get_text()
        recruitNumber = site.select('td')[2].get_text()
        workLocation = site.select('td')[3].get_text()
        publishTime = site.select('td')[4].get_text()

        item['name'] = name
        item['detailLink'] = url + detailLink
        item['catalog'] = catalog
        item['recruitNumber'] = recruitNumber
        item['publishTime'] = publishTime

        items.append(item)

    # Disable ascii encoding, press utf-8 encoding
    line = json.dumps(items,ensure_ascii=False)

    output.write(line.encode('utf-8'))
    output.close()

if __name__ == "__main__":
   tencent()

 JSON and JSONPath

JSON (JavaScript Object Notation) is a lightweight data exchange format that makes it easy for people to read and write. At the same time, it is convenient for the machine to analyze and generate. It is suitable for data interaction scenarios, such as data interaction between the front desk and the backend of a website.

JsonPath is an information extraction library. It is a tool for extracting specified information from JSON documents. It provides multiple language versions, including: Javascript, Python, PHP and Java.

JsonPath for JSON is equivalent to XPATH for XML.

Syntax comparison between JsonPath and XPath:

Json has a clear structure, high readability, low complexity, and is very easy to match. The following table corresponds to the usage of XPath.

Use JSONPath to crawl all cities on the Lagou network

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
# json parsing library, corresponding to lxml
import json
# json's parsing syntax, corresponding to xpath
import jsonpath

url = "http://www.lagou.com/lbs/getAllCitySearchLabels.json"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
request = urllib2.Request(url, headers = headers)

response = urllib2.urlopen(request)
# Take out the content in the json file, the returned format is a string
html = response.read()

# Convert a string in json form to a Unicode string in python form
unicodestr = json.loads(html)

#List in Python form
city_list = jsonpath.jsonpath(unicodestr, "$..name")

#for item in city_list:
# print item

# dumps() default Chinese is ascii encoding format, ensure_ascii defaults to Ture
# Disable ascii encoding format, the returned Unicode string is convenient to use
array = json.dumps(city_list, ensure_ascii=False)
#json.dumps(city_list)
#array = json.dumps(city_list)

with open("lagoucity.json", "w") as f:
    f.write(array.encode("utf-8"))

result:

 Embarrassment Encyclopedia Crawl

  1. Fuzzy query using XPATH
  2. Get the内容
  3. Save to json file
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import json
from lxml import etree

url = "http://www.qiushibaike.com/8hr/page/2/"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
request = urllib2.Request(url, headers = headers)

html = urllib2.urlopen(request).read()
# The response returned is a string, parsed into HTML DOM mode text = etree.HTML(html)

text = etree.HTML(html)
# Return the node position of all the paragraphs, contains() fuzzy query method, the first parameter is the tag to be matched, and the second parameter is the part of the tag name
node_list = text.xpath('//div[contains(@id, "qiushi_tag")]')

items ={}
for node in node_list:
    # The list returned by xpath, this list is just one parameter, extracted by index, user name
    username = node.xpath('./div/a/@title')[0]
    # Take out the content under the label, paragraph content
    content = node.xpath('.//div[@class="content"]/span')[0].text
    # Take out the content contained in the label and like it
    zan = node.xpath('.//i')[0].text
    # Comment
    comments = node.xpath('.//i')[1].text

    items = {
        "username": username,
        "content": content,
        "zan": zan,
        "comments": comments
    }

    with open("qiushi.json", "a") as f:
        f.write(json.dumps(items, ensure_ascii=False).encode("utf-8") + "\n")
Reference: https://cloud.tencent.com/developer/article/1091698 Getting started with python crawlers (3) XPATH and BeautifulSoup4-Cloud + Community-Tencent Cloud