从零开始,提供ag视讯玩法|注册论坛

快捷导航
广告联系qq1031180668ag自动下注软件|HOME
查看: 207|回复: 0
打印 上一主题 下一主题

[python] 爬取豆瓣TOP250(Beautifulsoup&lxml)

[复制链接]

classn_11

69

主题

69

帖子

152

积分

注册会员

Rank: 2

积分
152
跳转到指定楼层
楼主
发表于 2019-9-23 23:50:15 | 只看该作者 |只看大图 回帖奖励 |倒序浏览 |阅读模式
Beautifuisoup&lxml(自从用了这两个库后,麻麻再也不用担心我爬不到数据了 )
如果没有lxml的话,就把第九行的lxml改为html.parser

写的过程中最蛋疼的就是写到一半,然后就想到一个更简便的方法,改起来真的累。

[Python] syntaxhighlighter_viewsource syntaxhighlighter_copycode
from bs4 import BeautifulSoup
import requests
import re

count = 0
def get_html(url):
    res = requests.get(url)
    return res.text
def get_soup(html):return BeautifulSoup(html,'lxml')
def get_25(url):
    soup = get_soup(get_html(url))
    name = soup.body.find_all('span',attrs={'class':'title'})
    for i in name:
        if '/' in i.string:
            name.remove(i)
    actors = soup.body.find_all('p',attrs={'class':''})
    r_num = soup.body.find_all('span',attrs={'class':'rating_num'})
    inq = soup.body.find_all('span',attrs={'class':'inq'})
    if 25 - len(inq) != 0:
        for i in range(25-len(inq)):
            inq.append(inq[-1])
    for i in range(25):
        print('---%d---' % i + '<<' + name[i].string + '>>'+' 豆瓣Top %d'%(count+i+1))
        for c in actors[i].strings:
            print('*'+c.replace('\n','').replace(' ',''))
        print('*评分: ' + r_num[i].string + '分')
        print('*评价: "' + inq[i].string + ' "')
        print('*'*40)
    print('当前页数为第%d/10页' % ((count+25)/25))

def get_url(html):
    return re.findall(r'',html)

def get_page(result):
    global count
    if count > 24 and (result == 'L'  or result == 'l'):
        count -= 25
        main()
    elif count < 24 and (result == 'L'  or result == 'l'):
        print('当前页数为第一页,无法继续往前,')
        main()
    elif count < 224 and (result == 'N'  or result == 'n'):
        count += 25
        main()
    elif count > 224 and (result == 'N'  or result == 'n'):
        print('当前页数为最后一页,无法继续往后,')
        main()
    else:
        print('这不是有效输入,')
        main()

def get_message(url):
    global count
    soup = get_soup(get_html(url))
    title = soup.body.find('span',attrs={'property':'v:itemreviewed'})
    hidden = soup.body.find('span',{'class':'all hidden'})
    tp = soup.body.find_all('span',attrs={"property":"v:genre"})
    date = soup.body.find_all('span',attrs={"property":"v:initialReleaseDate"})
    lst = soup.body.find_all('span',attrs={'class':'pl'})
    print('*'*30+'\n\n' +title.string + '\n\n'+ '*'*30+'\n')
    print('导演 : ',end='')
    for i in lst[0].next_sibling.next_sibling.strings:
        print(i,end='')
    print('\n')
    print('编剧 : ',end='')
    for i in lst[1].next_sibling.next_sibling.strings:
        print(i,end='')
    print('\n')
    print('主演 : ',end='')
    for i in lst[2].next_sibling.next_sibling.strings:
        print(i,end='')
    print('\n')
    print('类型 : ',end='')
    for i in tp:
        print(i.string + ' / ',end='')
    print('\n')
    print('制片国家/地区 : '+str(lst[4].next_sibling)+'\n')
    print('语言 : '+str(lst[5].next_sibling)+'\n')
    print('上映日期 : ',end='')
    for i in date:
        print(i.string,end='')
    print('\n')
    print('片长 : '+ soup.body.find('span'\
                                  ,attrs={'property':'v:runtime'}).string+'\n')
    print('又名 : '+str(lst[8].next_sibling)+'\n')
    print('---剧情简介---')
    if hidden != None:
        for i in hidden.strings:
            print(i)
    else:
        for i in soup.body.find('span',attrs={'property':'v:summary'}):
            print(str(i).replace('
','').replace(' ','')) r = input('输入Q返回列表或输入M返回第一页: ') if r == 'q' or r == 'R': main() else: count = 0 main() def main(): url = 'https://movie.douban.com/top250?start='+str(count)+'&filter=' get_25(url) r = input('输入电影序号或上一页或下一页 L/N : ') if r.isdigit(): get_message(get_url(get_html(url))[int(r)]) else: get_page(r) main()




游客
回复
您需要登录后才可以回帖 登录 | 立即注册

手机版|Archiver|小黑屋|sitemap| 从零开始,提供ag视讯玩法|注册论坛 - 一个单纯的提供ag视讯玩法|注册学习交流论坛 ( 豫ICP备15032706号 )

GMT+8, 2019-10-25 06:10 , Processed in 1.178225 second(s), 25 queries .

Powered by Discuz! X3.4

? 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表