程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

bs4的BeautifulSoup模块使用,

发布于2019-08-22 17:44     阅读(618)     评论(0)     点赞(21)     收藏(2)


引用

from bs4 import BeautifulSoup    
html=requests.get('https://www.cnblogs.com/cate/python/')
soup=BeautifulSoup(html.text,'lxml')
  • 1
  • 2
  • 3

地址获取方法

items=soup.select('div[class="post_item_body"]')
for item in items:
	import requests
from bs4 import BeautifulSoup

headers={
    'User-Agent':'https://www.cnblogs.com/cate/python/',
    'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'cache-control':'max-age=0'
}
html=requests.get('https://www.cnblogs.com/cate/python/')
soup=BeautifulSoup(html.text,'lxml')
items=soup.select('div[class="post_item_body"]')
for item in items:
    title=item.select('h3 a[class="titlelnk"]')[0].get_text()
    href = item.select('h3 a[class="titlelnk"]')[0]['href']
    author=item.select('div a[class="lightblue"]')[0].get_text()
    author_home= item.select('div a[class="lightblue"]')[0]['href']
    infos=item.select('p[class="post_item_summary"]')[0].get_text().strip('\n').strip(' ')
    datas=item.select('div[class="post_item_foot"]')[0].get_text()
    datas=datas.split(' ')
# ['\n随风奔跑的少年', '\r\n', '', '', '', '发布于', '2019-07-31', '20:40', '\r\n', '', '', '', '\r\n', '', '', '', '', '', '', '', '评论(0)阅读(4)'] <class 'list'>
    #发布时间
    time=datas[6]+' '+datas[7]
    pinglun=datas[-1].lstrip('评论(').split(')')[0]
    read_num = datas[-1].lstrip('评论(').rstrip(')').split('(')[-1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

直接查找属性的类

for mulu in soup.find_all(class_='mulu'):
  • 1

requests官方文档:http://2.python-requests.org/zh_CN/latest/user/quickstart.html

import requests,re
from bs4 import BeautifulSoup
# div class="hide-featured-badge hide-favorite-badge"
headers={
    'user-agent':'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',

}

photos=[]
url="https://www.pexels.com/zh-tw/"

response=requests.get(url,headers=headers)
# print(response.text)
soup=BeautifulSoup(response.text,'lxml')
imgs=soup.select(' div.hide-featured-badge.hide-favorite-badge > article > a > img')#必须保留空格,存在两个属性必须加两个点
# print(imgs)
# print(len(imgs))
for img in imgs:
    photo=img.get('src')#获取img的属性src
    # print(photo)
    if photo.endswith('500'):
        photos.append(photo)
        # print(photos)
path=r'C:\Users\hh\Desktop\爬虫代码\正式\img'
for item in photos:
    data=requests.get(item,headers=headers)
    # photo_name=item.split('/')[4]+'.jpg'
    photo_name=re.findall('\d+\/(.*?)\?',item)##\转义字符,后面字符原来的意思
    # print(photo_name[0])
    if photo_name:
        fp=open(path+'/'+photo_name[0],'wb')
        print(photo_name[0]+'--保存完毕')
        fp.write(data.content)
        fp.close()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
'''
网易云音乐歌手信息抓取
url=https://music.163.com/discover/artist/cat?id=4001&initial=0
https://music.163.com/#/discover/artist/cat?id=4001&initial=0
id=4001  id歌手类别
initial=0
init=[-1,65-90,0]
'''
import requests
from bs4 import BeautifulSoup
import csv


def get_aritists(url):
    headers={
        'User-Agent':'Mozilla/5.0',
        'referer':'https://music.163.com/',

    }
    r=requests.get(url,headers=headers)
    r.encoding=r.apparent_encoding
    soup=BeautifulSoup(r.text,'lxml')

    for item in soup.find_all('a',attrs={'class':'nm nm-icn f-thide s-fc0'}):
        artist_name=item.string.strip()
        artist_id=item['href'].replace('/artist?id=','').strip()
        print(artist_id,artist_name)
        try:
            writer.writerow((artist_id,artist_name))
        except Exception as e:
            print('写入失败')
            print(e)

if __name__ == '__main__':
    id_list = [1001, 1002, 1003, 2001, 2002, 2003, 6001, 6002, 6003, 7001, 7002, 7003, 4001, 4002, 4003]
    init_list = [-1, 0, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
                 89, 90]
    # a=[x for x in range(65,91)]
    # print(a)
    csvfile=open('music_163_artist.csv','a',encoding='utf-8')
    writer=csv.writer(csvfile)
    writer.writerow(('artist_id','artist_name'))


    for i in  id_list:
        for j in init_list:
            url="https://music.163.com/discover/artist/cat?id={}&initial={}".format(i,j)
            # print(url)
            get_aritists(url)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
# '''
# 票房数据抓取
# 安装pandas,numpy
# - 直接安装anconda(集成环境)
# 利用pandas保存数据
#
# - 抓取中国票房网站的数据
# - url = 'www.cbooo.cn'
# http://www.cbooo.cn/year?year=2019
# '''
import pandas as pd
import requests
from bs4 import BeautifulSoup
url='http://www.cbooo.cn/year?year=2019'
datas=requests.get(url).text
# print(datas)
# 利用bs4对数据进行解析
soup=BeautifulSoup(datas,'lxml')

#获取数据集合
# movies_table=soup.find_all('table',{'id':'tbContent'})
# print(movies_table)
name=[]
# movies=movies_table.findAll('tr')#获取所有的tr
for table in soup.find_all('table',{'id':'tbContent'}):
    name.extend([names for names in table.findAll('tr')])

    # print(name)
# 获取电影名称
names=[tr.find_all('td')[0].a.get('title') for tr in name[1:]]
# print(names)
#获取电影url地址
hrefs=[tr.find_all('td')[0].a.get('href') for tr in name[1:]]
# print(hrefs)
#获取电影类型
types=[tr.find_all('td')[1].text for tr in name[1:]]
# print(types)
#获取总票房数据
boxoffice=[int(tr.find_all('td')[2].text) for tr in name[1:]]#字符串变为整数类型
# print(boxoffice)
#获取平均票价
mean_price=[int(tr.find_all('td')[3].text) for tr in name[1:]]
# print(mean_price)
# 场均人次
mean_people=[int(tr.find_all('td')[4].text) for tr in name[1:]]
# print(mean_people)
#获取国家和地区
contries=[tr.find_all('td')[5].text for tr in name[1:]]
# print(contries)
# 获取上映时间
times=[tr.find_all('td')[6].text for tr in name[1:]]
# print(times)
def getInfo(url):
    datas=requests.get(url)
    soup=BeautifulSoup(datas.text,'lxml')
    #获取导演
    DaoYan=soup.select('dl.dltext dd')[0].get_text()
    return DaoYan
directors=[ getInfo(url).strip('\n') for url in hrefs]#去掉\n
# print(directors)


#显示所有列
pd.set_option('display.max_columns', None)

#显示所有行
pd.set_option('display.max_rows', None)

#设置value的显示长度为100,默认为50
# pd.set_option('max_colwidth',100)
df=pd.DataFrame({
    'name':names,
    'href':hrefs,
    '票房':boxoffice,
    '总票价':mean_price,
    '场均人次':mean_people,
    '国家地区':contries,
    'time':times,
    '导演':directors
})
print(df)
#数据存储
df.to_csv('movies.csv',encoding='utf-8')









  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92


所属网站分类: 技术文章 > 博客

作者:fggfg

链接:https://www.pythonheidong.com/blog/article/53258/5f84523c28d9b2c82c56/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

21 0
收藏该文
已收藏

评论内容:(最多支持255个字符)