爬个虫试试看,请教一下pathon

图片 1

# encoding: utf-8

利用Beautifulsoup爬取知名笑话网站

package main

自己的代码部分:fromurllib.requestimporturlopenfromurllib.requestimportRequestfromurllib.errorimportHTTPErrorimportrefrombs4importBeautifulSoupheaders={'User-Agent':'Mozilla/5.0(WindowsNT6.1;rv:68.0)Gecko/20100101Firefox/68.0'}html1=Request('',headers=headers)html=urlopen(html1)bs=BeautifulSoup(html.read(),'html.parser')nameList3=bs.findAll('div',{'id':'BAIDU_DUP_fp_wrapper'})forname3innameList3:print(name3)网站内容为:运行结果什么都不显示,这是为啥?

import urllib2,requests  #引入库名字需要写对,如果不确定,就可以到自己的python安装包里去确认#下,python安装包地址:C:Python27Libsite-packages 如:C:Python27Libsite-packagesbs4

首先我们来看看需要爬取的网站:http://xiaohua.zol.com.cn/

import(
"fmt"
"net/smtp"
"encoding/base64"
)
//html,plain
func SendMail( title,user,pswd,smtpserver,port,from,to,subject,body,format string ) error {
bs64 := base64.NewEncoding("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 /")
header := make(map[string]string)
header["From"] = title "<" from ">"
header["To"] = to
header["Subject"] = fmt.Sprintf("=?UTF-8?B?%s?=", bs64.EncodeToString([]byte(subject)))
header["MIME-Version"] = "1.0"
header["Content-Type"] = "text/" format "; charset=UTF-8"
header["Content-Transfer-Encoding"] = "base64"
data := ""
for k, v := range header {
data = k ": " v "rn"
}
data = "rn" bs64.EncodeToString([]byte(body))

from bs4 import BeautifulSoup

图片 2

err := smtp.SendMail( smtpserver ":" port,smtp.PlainAuth("",user,pswd,smtpserver),from,[]string{to},[]byte(data) )
return err
}

import os

1.开始前准备

1.1 python3,本篇博客内容采用python3来写,如果电脑上没有安装python3请先安装python3.

1.2 Request库,urllib的升级版本打包了全部功能并简化了使用方法。下载方法:

pip install requests

1.3 Beautifulsoup库, 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.。下载方法:

pip install beautifulsoup4

1.4 LXML,用于辅助Beautifulsoup库解析网页。(如果你不用anaconda,你会发现这个包在Windows下pip安装报错)下载方法:

pip install lxml

1.5 pycharm,一款功能强大的pythonIDE工具。下载官方版本后,使用license sever免费使用(同系列产品类似),具体参照http://www.cnblogs.com/hanggegege/p/6763329.html。

func main(){
title := "DeepData"
from := "info@deepdata.cn"
to := "123456@qq.com"
subject := "TEST SUBJECT"
body := "深数据 deepdata.cn"
smtpserver := "smtp.mxhichina.com"
pswd := "***password***"
err := SendMail( title,from,pswd,smtpserver,"25",from,to,subject,body,"plain" )
fmt.Println( err )
}

#需要pip install urllib2,requests,beautifulsoup4,lxml  ……windows可以使用pip list查看当前装了哪些库

2.爬取过程演示与分析

from bs4 import BeautifulSoup

import os

import requests

导入需要的库,os库用来后期储存爬取内容。

随后我们点开“最新笑话”,发现有“全部笑话”这一栏,能够让我们最大效率地爬取所有历史笑话!

图片 3

我们来通过requests库来看看这个页面的源代码:

from bs4 import BeautifulSoup

import os

import requests

all_url = '

headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

all_html=requests.get(all_url,headers = headers)

print(all_html.text)

header是请求头,大部分网站没有这个请求头会爬取失败

部分效果如下:

图片 4

通过源码分析发现我们还是不能通过此网站就直接获取到所有笑话的信息,因此我们在在这个页面找一些间接的方法。

图片 5

点开一个笑话查看全文,我们发现此时网址变成了

我们的目的是找到所有形如

我们在“全部笑话”页面随便翻到一页: ,按下F12查看其源代码,按照其布局发现 :

图片 6

每个笑话对应其中一个

标签,分析得每个笑话展开全文的网址藏在href当中,我们只需要获取href就能得到笑话的网址

from bs4 import BeautifulSoup

import os

import requests

all_url = '

'

headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

all_html=requests.get(all_url,headers = headers)

#print(all_html.text)

soup1 = BeautifulSoup(all_html.text,'lxml')

list1=soup1.find_all('li',class_ = 'article-summary')

for i in list1:

    #print(i)

    soup2 = BeautifulSoup(i.prettify(),'lxml')

    list2=soup2.find_all('a',target = '_blank',class_='all-read')

    for b in list2:

        href = b['href']

        print(href)

我们通过以上代码,成功获得第一页所有笑话的网址后缀:

图片 7

也就是说,我们只需要获得所有的循环遍历所有的页码,就能获得所有的笑话。

上面的代码优化后:

from bs4 import BeautifulSoup

import os

import requests

all_url = '

'

def Gethref(url):

    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    html = requests.get(url,headers = headers)

    soup_first = BeautifulSoup(html.text,'lxml')

    list_first = soup_first.find_all('li',class_='article-summary')

    for i in list_first:

        soup_second = BeautifulSoup(i.prettify(),'lxml')

        list_second = soup_second.find_all('a',target = '_blank',class_='all-read')

        for b in list_second:

            href = b['href']

            print(href)

Gethref(all_url)

使用如下代码,获取完整的笑话地址url

from bs4 import BeautifulSoup

import os

import requests

all_url = '

'

def Gethref(url):

    list_href = []

    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    html = requests.get(url,headers = headers)

    soup_first = BeautifulSoup(html.text,'lxml')

    list_first = soup_first.find_all('li',class_='article-summary')

    for i in list_first:

        soup_second = BeautifulSoup(i.prettify(),'lxml')

        list_second = soup_second.find_all('a',target = '_blank',class_='all-read')

        for b in list_second:

            href = b['href']

            list_href.append(href)

    return list_href

def GetTrueUrl(liebiao):

    for i in liebiao:

        url = '

' str(i)

        print(url)

GetTrueUrl(Gethref(all_url))

简单分析笑话页面html内容后,接下来获取一个页面全部笑话的内容:

from bs4 import BeautifulSoup

import os

import requests

all_url = '

'

def Gethref(url):

    list_href = []

    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    html = requests.get(url,headers = headers)

    soup_first = BeautifulSoup(html.text,'lxml')

    list_first = soup_first.find_all('li',class_='article-summary')

    for i in list_first:

        soup_second = BeautifulSoup(i.prettify(),'lxml')

        list_second = soup_second.find_all('a',target = '_blank',class_='all-read')

        for b in list_second:

            href = b['href']

            list_href.append(href)

    return list_href

def GetTrueUrl(liebiao):

    list = []

    for i in liebiao:

        url = '

' str(i)

        list.append(url)

    return list

def GetText(url):

    for i in url:

        html = requests.get(i)

        soup = BeautifulSoup(html.text,'lxml')

        content = soup.find('div',class_='article-text')

        print(content.text)

GetText(GetTrueUrl(Gethref(all_url)))

效果图如下:

图片 8

现在我们开始存储笑话内容!开始要用到os库了

使用如下代码,获取一页笑话的所有内容!

from bs4 import BeautifulSoup

import os

import requests

all_url = '

'

os.mkdir('/home/lei/zol')

def Gethref(url):

    list_href = []

    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    html = requests.get(url,headers = headers)

    soup_first = BeautifulSoup(html.text,'lxml')

    list_first = soup_first.find_all('li',class_='article-summary')

    for i in list_first:

        soup_second = BeautifulSoup(i.prettify(),'lxml')

        list_second = soup_second.find_all('a',target = '_blank',class_='all-read')

        for b in list_second:

            href = b['href']

            list_href.append(href)

    return list_href

def GetTrueUrl(liebiao):

    list = []

    for i in liebiao:

        url = '

' str(i)

        list.append(url)

    return list

def GetText(url):

    for i in url:

        html = requests.get(i)

        soup = BeautifulSoup(html.text,'lxml')

        content = soup.find('div',class_='article-text')

        title = soup.find('h1',class_ = 'article-title')

        SaveText(title.text,content.text)

def SaveText(TextTitle,text):

    os.chdir('/home/lei/zol/')

    f = open(str(TextTitle) 'txt','w')

    f.write(text)

    f.close()

GetText(GetTrueUrl(Gethref(all_url)))

效果图:

图片 9

(因为我的系统为linux系统,路径问题请按照自己电脑自己更改)

我们的目标不是抓取一个页面的笑话那么简单,下一步我们要做的是把需要的页面遍历一遍!

通过观察可以得到全部笑话页面url为

接下来我们再次修改代码:

from bs4 import BeautifulSoup

import os

import requests

num = 1

url = '

' str(num) '.html'

os.mkdir('/home/lei/zol')

def Gethref(url):

    list_href = []

    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

    html = requests.get(url,headers = headers)

    soup_first = BeautifulSoup(html.text,'lxml')

    list_first = soup_first.find_all('li',class_='article-summary')

    for i in list_first:

        soup_second = BeautifulSoup(i.prettify(),'lxml')

        list_second = soup_second.find_all('a',target = '_blank',class_='all-read')

        for b in list_second:

            href = b['href']

            list_href.append(href)

    return list_href

def GetTrueUrl(liebiao):

    list = []

    for i in liebiao:

        url = '

' str(i)

        list.append(url)

    return list

def GetText(url):

    for i in url:

        html = requests.get(i)

        soup = BeautifulSoup(html.text,'lxml')

        content = soup.find('div',class_='article-text')

        title = soup.find('h1',class_ = 'article-title')

        SaveText(title.text,content.text)

def SaveText(TextTitle,text):

    os.chdir('/home/lei/zol/')

    f = open(str(TextTitle) 'txt','w')

    f.write(text)

    f.close()

while num<=100:

    url = '

' str(num) '.html'

    GetText(GetTrueUrl(Gethref(url)))

    num=num 1

大功告成!剩下的等待文件下载完全就行拉!

效果图:

图片 10

谢谢观看!

def download(url): # 没有伪装的下载器

print("Downloading: %s" % url)

try:

result = urllib2.request.urlopen(url, timeout=2).read()

except urllib.error.URLError as e:

print("Downloading Error:", e.reason)

result = None

return result

def download_browser(url, headers): # 带浏览器伪装的下载器

opener = urllib2.build_opener() #伪装浏览器

opener.addheaders = headers #伪装浏览器header

print("Downloading: %s" % url)

try:

result = opener.open(url, timeout=2)

result = result.read()

print("Download OK!")

except urllib2.request.URLError as e:

print("Downloading error:", e.reason)

result = None

return result

# 解析首页,获取url

def bs_parser(html):

tree = BeautifulSoup(html, 'lxml')

#使用的lxml方式读取,所以需要安装lxml语言里和XML以及HTML工作的功能最丰富和最容易使用库

data = tree.find('div', class_='x-sidebar-left-content').find_all('a')  #这个结构需要到具体需要爬取的网页#里去自己找。

print(data[0].attrs['href'])

urls = []

titles = []

grades = []

for item in data:

urls.append(item.attrs['href'])

titles.append(item.get_text())

return urls, titles

# 解析页面内容

def bs_parser_content(html):

tree = BeautifulSoup(html, 'lxml')

data = tree.find('div', class_='x-wiki-content')

# print(data)

result = data.get_text()

return result

# 首页url

url = ''

root = ''

# header一定是一个元组列表

headers = [

('Connection', 'Keep-Alive'),

('Accept', 'text/html, application/xhtml xml, */*'),

('Accept-Language', 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3'),

('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko')

]

html = download_browser(url, headers) # 下载首页的HTML

urls, titles = bs_parser(html) # 解析首页的HTML,返回URL和标题

i = 0

for item,title in zip(urls, titles):

if i==5:

break

i = 1

url=root item

html=download_browser(url, headers) # 下载页面html

result=bs_parser_content(html) # 解析html,获取数据

# 合成文本文件路径

fileName=str(i) '_' title.replace(r'/',' ') '.txt'

fileName= os.path.join('Results/', fileName)

print("fileName path is %s:" %fileName)

# 将数据写入到文本文件

with open(fileName,'w') as f:

f.write(result.encode('utf-8').strip())

本文由新葡萄京娱乐场8522发布于计算机编程,转载请注明出处:爬个虫试试看,请教一下pathon

TAG标签:
Ctrl+D 将本页面保存为书签,全面了解最新资讯,方便快捷。