Python公开课 - 页面解析之Beautiful Soup

前言

python beautiful soup

Beautiful Soup提供一些简单的、Python式的函数来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

同时可以自动将输入文档转换为Unicode编码，输出文档转换为UTF-8 编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。

所以说，利用Beautiful Soup可以省去很多烦琐的提取工作，提高解析效率。

Beautiful Soup基本使用

我们先看一个例子：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.xtuz.net/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://www.xtuz.net/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://www.xtuz.net/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.prettify())
print(soup.title.string)

输出：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://www.xtuz.net/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://www.xtuz.net/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://www.xtuz.net/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

这里调用prettify()方法，可以把要解析的字符串以标准的缩进格式输出，然后调用soup.title.string，用来输出HTML中title节点的文本内容。

查找

直接选择元素

print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

输出：
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

首先打印输出title节点的选择结果，输出结果正是title节点加里面的文字内容。

接下来，输出它的类型，是bs4.element.Tag类型，这是Beautiful Soup中一个重要的数据结构。经过选择器选择后，选择结果都是这种 Tag 类型。Tag 具有一些属性，比如string属性，调用该属性，可以得到节点的文本内容，所以接下来的输出结果正是节点的文本内容。

我们选择了head节点，结果也是节点加其内部的所有内容。

最后，选择了p节点，不过这次情况比较特殊，我们发现结果是第一个p节点的内容，后面的几个p节点并没有选到，也就是说，当有多个节点时，这种选择方式只会选择到第一个匹配的节点，其他的后面节点都会忽略。

find_all() - 查询符合条件的元素

通过给该方法传入一些属性或文本，就可以得到符合条件的元素，它的功能十分强大。

根据节点名来查找

print(soup.find_all(name='p'))
print(type(soup.find_all(name='p')[0]))

输出：
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://www.xtuz.net/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://www.xtuz.net/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://www.xtuz.net/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
<class 'bs4.element.Tag'>

根据属性来查找

print(soup.find_all(attrs={'class':'story'}))

输出：
[<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://www.xtuz.net/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://www.xtuz.net/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://www.xtuz.net/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

根据文本来查找

import re
print(soup.find_all(text=re.compile('Lacie')))

输出：
['Lacie']

find() - 查询首个符合条件的元素

find()方法，返回的是单个元素，也就是第一个匹配的元素, 使用同上述一样，不做更多说明。

select() - CSS选择查找

在Tag或BeautifulSoup对象的select() 方法中传入字符串参数，即可使用CSS选择器的语法找到Tag。

print(soup.select('.story'))

输出：
[<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://www.xtuz.net/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://www.xtuz.net/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://www.xtuz.net/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

可以看到用select()函数，也非常方便。

数据提取

提取名称

可以利用name属性获取节点的名称。选取title节点，然后调用name属性就可以得到节点名称：

print(soup.title.name) 

输出：
title

提取属性

每个节点可能有多个属性，比如id和class等，选择这个节点元素后，可以调用attrs获取所有属性：

print(soup.p.attrs)
print(soup.p.attrs['class'])

输出：
{'class': ['title']}
['title']

其中attrs的返回结果是字典形式，它把选择的节点的所有属性和属性值组合成一个字典。

这里需要注意的是，有的返回结果是字符串，有的返回结果是字符串组成的列表，在实际情况中，我们需要根据不同的返回类型做出正确的响应处理。

提取文本内容

可以利用string 属性获取节点元素包含的文本内容，比如要获取第一个p节点的文本

print(soup.p.string) 

输出：
The Dormouse's story

小结

Beautiful Soup作为用Python写的一个HTML/XML的解析器，它可以很好的处理不规范的标记，并提供简单又常用的搜索，抽取功能，大大节省你的编程时间。

Python公开课 - 页面解析之Beautiful Soup

前言

Beautiful Soup基本使用

查找

数据提取

小结

相关阅读

相关推荐