Parse web pages using CSS class by BeautifulSoup

字数统计: 299阅读时长: 1 min

 2018/08/24   Share

Parse web pages using CSS by BeautifulSoup

This essay refer to BeautifulSoup 解析网页: CSS

Get the content using CSS class

In ther source pages of web writing by html, it uses CSS to decorate the web. We can use the different class of CSS and the tag in it to get the content that we want.

Demo

from bs4 import BeautifulSoup
from urllib.request import urlopen
#if has Chinese, apply decode() 'utf-8'
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
'''insert html_text to 'soup' by using BeautifulSoup library with the feature 'lxml',
can learn more feature of analyze html in BeautifulSoup
'''
soup = BeautifulSoup(html,features = 'lxml')
#print the tag 'month' in tag 'li'
all_month = soup.find_all('li',{'class': 'month'})
for month in all_month:
    print(month.get_text())
#prnit the tag 'jan' in tag 'ul'
all_jan = soup.find_all('ul',{'class':'jan'})
for jan in all_jan:
    print(jan.get_text())

In the source pages, we can see the class of CSS and it also in the html source.

<ul>
        <li class="month">一月</li>
        <ul class="jan">
                <li>一月一号</li>
                <li>一月二号</li>
                <li>一月三号</li>
        </ul>
        <li class="feb month">二月</li>
        <li class="month">三月</li>
        <li class="month">四月</li>
        <li class="month">五月</li>
</ul>

</body>
</html>

In the Demo code, we use different class and tag to match the different content. Just remember it! It’s easy.

Next Post

Numpy generate array
Previous Post

Regular Expression

CATALOG

1. Parse web pages using CSS by BeautifulSoup
1. 1.0.1. Get the content using CSS class

