python中的urllib模块中的方法

发布时间：2019-08-25 09:32:48编辑：auto阅读（2020）

python urllib.request之urlopen函数

urllib是基于http的高层库，它有以下三个主要功能：

（1）request处理客户端的请求

（2）response处理服务端的响应

（3）parse会解析url

下面讨论的是request

urllib.request模块定义了一些打开URLs（一般是HTTP协议）复杂操作像是basic 和摘要模式认证，重定向，cookies等的方法和类。这个模块式模拟文件模块实现的，将本地的文件路径改为远程的url。因此函数返回的是类文件对象（file-like object）

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url可以是一个字符串形式或者Request 对象

如果data参数有值就是用post方式响应否则默认为GET 方式

urllib.request 模块使用HTTP/1.1 的无连接的状态协议

urlopen()函数返回类文件对象，提供以下内建方法：

- read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

- info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

- getcode()：返回Http状态码。（详情参考https://tools.ietf.org/html/rfc7231#section-6）

如果是http请求:

1xx(informational):请求已经收到，正在进行中

2xx(successful):请求成功接收，解析，完成

3xx(Redirection):需要重定向

4xx(Client Error):客户端问题,请求存在语法错误，网址未找到

5xx(Server Error):服务器问题

- geturl()：返回请求的url

#!/opt/yrd_soft/bin/python

import re

import urllib2

import requests

import lxml

from bs4 import BeautifulSoup

url = 'http://www.baidu.com'

#page=urllib2.urlopen(url)

page=requests.get(url).text

pagesoup=BeautifulSoup(page,'lxml')

for link in pagesoup.find_all(name='a',attrs={"href":re.compile(r'^http:')}):

print link.get_text()

通过BeautifulSoup库的get_text方法找到网页的正文：

#!/usr/bin/env python

#coding=utf-8

#HTML找出正文

import requests

from bs4 import BeautifulSoup

url='http://www.baidu.com'

html=requests.get(url)

soup=BeautifulSoup(html.text)

print soup.get_text()

1 urllib2 简介

urllib2是python自带的一个访问网页及本地文件的库。

与urllib相比，显著区别之处在于：

1) urllib2可以接受一个Request类的实例来设置URL请求的headers，urllib仅可以接受URL。这意味着，用urllib时不可以伪装User Agent字符串等。

2) urllib提供urlencode方法用来encode发送的data，而urllib2没有。这是为何urllib常和urllib2一起使用的原因。

2 urllib2 常用方法

2.1 urllib2.urlopen

urlopen()是最简单的请求方式，它打开url并返回类文件对象，并且使用该对象可以读取返回的内容

urllib2.urlopen(url[, data][, timeout])

参数：

url: 可以是包含url的字符串，也可以是urllib2.request类的实例。

data: 是经过编码的post数据（一般使用urllib.urlencode()来编码）。

没有data参数时为GET请求，设置data参数时为POST请求

timeout: 是可选的超时期（以秒为单位），设置请求阻塞的超时时间，如果没有设置的话，会使用全局默认timeout参数，该参数只对HTTP、HTTPS、FTP生效

假设urlopen()返回的文件对象u，它支持下面的这些常用的方法：

u.read([nbytes]) 以字节字符串形式读取nbytes个数据

u.readline() 以字节字符串形式读取单行文本

u.readlines() 读取所有输入行然后返回一个列表

u.close() 关闭链接

u.getcode() 返回整数形式的HTTP响应代码，比如成功返回200,未找到文件时返回404

u.geturl() 返回所返回的数据的实际url，但是会考虑发生的重定向问题

u.info() 返回映射对象，该对象带有与url关联的信息。

对HTTP来说，返回的服务器响应包含HTTP包头。

对于FTP来说，返回的报头包含'content-length'。

对于本地文件，返回的报头包含‘content-length’和'content-type'字段。

注意：

类文件对象u以二进制模式操作。如果需要以文本形式处理响应数据，则需要使用codecs模块或类似

方式解码数据。

附代码：

>>> import urllib2

>>> res=urllib2.urlopen('http://www.51cto.com')

>>>res.read()

。。。。。。（一堆源代码）

>>>res.readline()

'<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0

Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n'

>>>res.readlines()

。。。（list形式的一堆源码）

>>>res.info()

<httplib.HTTPMessage instance at0x1a02638>

>>>res.getcode()

200

>>>res.geturl()

'http://www.51cto.com'

#最后关闭连接

>>> res.close()

2.2 urllib2.request

新建Request实例

Request (url [data,headers[,origin_req_host ,[unverifiable]]]])

说明:

对于比较简单的请求，urlopen()的参数url就是一个代表url的，但如果需要执行更复杂的操作，如修改HTTP报头，可以创建Request实例并将其作为url参数

参数:

url: 为url字符串，

data: 是伴随url提交的数据（比如要post的数据）。不过要注意，提供data参数时，它会将HTTP请求从'GET'改为‘POST’。

headers: 是一个字典，包含了可表示HTTP报头的键值映射（即要提交的header中包含的内容）。

origin_req_host: 通常是发出请求的主机的名称，如果请求的是无法验证的url（通常是指不是用户直接输入的url，比如加载图像的页面中镶入的url），则后一个参数unverifiable设为TRUE

假设Request实例r，其比较重要的方法有下面几个:

r.add_data(data) 向请求添加数据。如果请求是HTTP请求，则方法改为‘POST’。

data是向指定url提交的数据，要注意该方法不会将data追教导之前已经设置的任何数据上，而是使用现在的data替换之前的。

r.add_header(key, val) 向请求添加header信息，key是报头名，val是报头值，两个参数都是字符串。

r.addunredirectedheader(key,val) 作用基本同上，但不会添加到重定向请求中。

r.set_proxy(host, type) 准备请求到服务器。使用host替换原来的主机，使用type替换原来的请求类型。

附代码：

1 向网页提交数据:

>>> import urllib

>>> import urllib2

>>> url='http://www.51cto.com'

>>> info={'name':"51cto",'location':'51cto'}

#info需要被编码为urllib2能理解的格式，这里用到的是urllib

>>> data=urllib.urlencode(info)

>>> data

'name=51cto&location=51cto'

>>> request=urllib2.Request(url,data)

>>> response=urllib2.urlopen(request)

>>> the_page=response.read()

2 修改网页头信息:

有时会碰到，程序也对，但是服务器拒绝你的访问。这是为什么呢?问题出在请求中的头信息(header)。有的服务端有洁癖，不喜欢程序来触摸它。这个时候你需要将你的程序伪装成浏览器来发出请求。请求的方式就包含在header中。

在使用 REST 接口时，Server 会检查Content-Type字段，用来确定 HTTP Body 中的内容该怎样解析。

>>> import urllib

>>> import urllib2

>>> url='http://www.51cto.com'

# 将user_agent写入头信息

>>> user_agent='Mozilla/4.0 (compatible; MSIE 5.5; WindowsNT)'

>>>values={'name':'51cto','location':"51cto",'language':'Python'}

>>> headers={'User-Agent':user_agent}

>>> data=urllib.urlencode(values)

>>> req=urllib2.Request(url,data,headers)

>>> response=urllib2.urlopen(req)

>>> the_page=response.read()

2.3 异常处理

不能处理一个respons时，urlopen抛出一个urlerror

urllib2.URLError：

urllib2.HTTPError：

HTTPerror是HTTP URL在特别的情况下被抛出的URLError的一个子类。

urlerror：

通常，urlerror被抛出是因为没有网络连接（没有至特定服务器的连接）或者特定的服务器不存在。在这种情况下，含有reason属性的异常将被抛出，以一种包含错误代码和文本错误信息的tuple形式。

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib2

#多写了一个 m （comm）

req = urllib2.Request('http://www.51cto.comm')

try:

urllib2.urlopen(req)

except urllib2.URLError,e:

print e

print e.reason

结果：

python之web模块学习，基本上涉及常用的的web模块，包括 urllib、urllib2、httplib、urlparse、requests，现在，开始我们的第一个模块的学习吧。

1 urllib简介

python urllib 模块提供了一个从指定的URL地址获取网页数据，然后对其进行分析处理，获取我们想要的数据。

2 常用方法

2.1 urlopen -- 创建一个类文件对象为读取指定的URL

help(urllib.urlopen)

urlopen(url, data=None, proxies=None)

Create a file-like object for the specified URL to read from.

参数：

url ：表示远程数据的路径，一般是http或者ftp路径。

data ：表示以get或者post方式提交到url的数据。

proxies ：表示用于代理的设置。

Python 通过urlopen函数来获取html数据，urlopen返回一个类文件对象，它提供了如下常用方法：

1）read() , readline() , readlines()，fileno()和close()：这些方法的使用与文件对象完全一样。

2）info()：返回一个httplib.HTTPMessage 对象，表示远程服务器返回的头信息。

3）getcode()：返回Http状态码，如果是http请求，200表示请求成功完成;404表示网址未找到。

4）geturl()：返回请求的url地址。

附代码：

>>> import urllib

>>> response = urllib.urlopen('http://www.51cto.com')

>>>res.read()

。。。。。。（一堆网页代码）

>>>res.readline()

'<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n'

。。。。

>>>res.readlines()

。。。（list形式的一堆网页代码）

>>>res.info()

<httplib.HTTPMessage instance at0x1a02638>

>>> response.getcode() # 返回Http状态码

200

>>> response.geturl() # 返回请求的url地址

'http://www.51cto.com'

>>> response.close() # 最后别忘了关闭连接

urllib中还提供了一些辅助方法，用于对url进行编码、解码。url中是不能出现一些特殊的符号的，有些符号有特殊的用途。我们知道以get方式提交数据的时候，会在url中添加key=value这样的字符串，所以在value中是不允许有'='，因此要对其进行编码；与此同时服务器接收到这些参数的时候，要进行解码，还原成原始的数据。这个时候，这些辅助方法会很有用：

附带的其他方法：（主要是url编码解码）

- urllib.quote(string[, safe])：对字符串进行编码。参数safe指定了不需要编码的字符

- urllib.unquote(string) ：对字符串进行解码

- urllib.quote_plus(string [ , safe ] ) ：与urllib.quote类似，但这个方法用'+'来替换' '，而quote用'%20'来代替' '

- urllib.unquote_plus(string ) ：对字符串进行解码

- urllib.urlencode(query[, doseq])：将dict或者包含两个元素的元组列表转换成url参数。例如字典{'name': 'wklken', 'pwd':'123'}将被转换为"name=wklken&pwd=123" （常用）

#这里可以与urlopen结合以实现post方法和get方法

- urllib.pathname2url(path)：将本地路径转换成url路径

- urllib.url2pathname(path)：将url路径转换成本地路径

附代码：

>>> import urllib

>>>res = urllib.quote('I am 51cto')

>>> res

'I%20am%2051cto'

>>>urllib.unquote(res)

'I am 51cto'

>>>res = urllib.quote_plus('I am 51cto')

>>> res

'I+am+51cto'

>>>urllib.unquote_plus(res)

'I am 51cto'

>>> params = {'name':'51cto','pwd':'51cto'}

>>>urllib.urlencode(params)

'pwd=51cto&name=51cto'

>>>l2u=urllib.pathname2url('E:\51cto')

'E%3A%29cto'

>>>urllib.url2pathname(l2u)

'E:)cto'

2.2 urlretrieve -- 直接将远程的数据下载到本地

help(urllib.urlretrieve)

urlretrieve(url, filename=None,reporthook=None, data=None)

参数：

url ：指定下载的URL

finename ：指定了保存本地路径（如果参数未指定，urllib会生成一个临时文件保存数据。

reporthook ：是一个回调函数，当连接上服务器、以及相应的数据块传输完毕时会触发该回调，我们可以利用这个回调函数来显示当前的下载进度。

data：表示post到服务器的数据，该方法返回一个包含两个元素的(filename,headers)元组，

下面是一个 urlretrieve方法下载文件的实例，可以显示下载进度：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import urllib

import os

def schedule(a,b,c):

''' 回调函数

@a:已经下载的数据

@b:数据块的大小

@c:远程文件的大小

'''

per = 100.0 * a * b / c

if per > 100:

per = 100

print "%.2f%%" % per

url = 'http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.bz2'

local = os.path.join('c:','Python-2.7.5.tar.bz2')

urllib.urlretrieve(url,local,schedule)

2.3 urlcleanup -- 清除由于urllib.urlretrieve()所产生的缓存

通过上面的练习可以知道，urlopen可以轻松获取远端html页面信息，然后通过python正则对所需要的数据进行分析，匹配出想要用的数据，然后利用urlretrieve将数据下载到本地。对于访问受限或者对连接数有限制的远程url地址可以采用proxies（代理的方式）连接，如果远程数据量过大，单线程下载太慢的话可以采用多线程下载，这个就是传说中的爬虫。

上面介绍的前两个方法是urllib中最常用的方法，这些方法在获取远程数据的时候，内部会使用URLopener或者 FancyURLOpener类。作为urllib的使用者，我们很少会用到这两个类。如果对urllib的实现感兴趣，或者希望urllib支持更多的协议，可以研究这两个类

urllib2是python自带的模块，有简单请求方法，也有复杂的http验证，http代理方法，今天就介绍几个基本的http请求方法。

urllib2的urlopen方法可以直接添加url即可访问，但是此方法不支持验证和代理的方法，所以后边会介绍urllib2的Request类和opener

urllib2.urlopen

urllib2.urlopen(url,data=None,timeout=1,cafile=None,capath=None,cadefault=False,context=None)

下面是urllib2发起http请求，获取httpcode

In [1]: import urllib2

In [2]: url = 'http://test.nginxs.net/index.html'

In [3]: res = urllib2.urlopen(url)

#读取请求结果

In [5]: print res.read()

this is index.html

#获取httpcode

In [6]: print res.code

200

#获取http请求头

In [7]: print res.headers

Server: nginxsweb

Date: Sat, 07 Jan 2017 02:42:10 GMT

Content-Type: text/html

Content-Length: 19

Last-Modified: Sat, 07 Jan 2017 01:04:12 GMT

Connection: close

ETag: "58703e8c-13"

Accept-Ranges: bytes

urllib2的Request

(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False)

通过http请求发送数据给zabbix登录（非api的登录）

import urllib

import urllib2

url = 'http://192.168.198.116/index.php'

#这里是zabbix的账户，密码，和点击的操作

data = {'name' : 'admin',

'password' : 'zabbix',

'enter' : 'Sign in' }

#编码上面的数据

data = urllib.urlencode(data)

#实例化Request类

req = urllib2.Request(url, data)

#发送请求

response = urllib2.urlopen(req)

the_page = response.read()

print the_page

urllib2.HTTPBasicAuthHandler

验证nginx的用户进行登录

#初始化一个auth_handler实例它的功能是http验证的类

auth_handler = urllib2.HTTPBasicAuthHandler()

top_level_url = 'http://test.nginxs.net/limit/views.html'

username='admin'

password='nginxs.net'

#调用auth_handler实例里面的密码方法

auth_handler.add_password(realm='Nginxs.net login',uri='

#构建一个opener 对象，它将使用http，https，ftp这三个默认的handlers

opener = urllib2.build_opener(auth_handler)

#安装一个opener作为全局urlopen()使用的对象

urllib2.install_opener(opener)

res=urllib2.urlopen(top_level_url)

print res.read()

urllib2.ProxyHandler

使用代理访问一个网站

url='http://new.nginxs.net/ip.php'

#初始化一个名为proxy代理实例

proxy = urllib2.ProxyHandler({'http':'127.0.0.1:1080'})

#这里创建一个opener对象，这个对象可以处理http，https，ftp等协议，如果你指定了子类参数这个默认的handler将不被使用

opener = urllib2.build_opener(proxy)

#使用open方法打开url

res = opener.open(url)

#查看返回结果

print res.read()

Python爬虫主要使用的是urllib模块，Python2.x版本是urllib2，很多博客里面的示例都是使用urllib2的，因为我使用的是Python3.3.2，所以在文档里面没有urllib2这个模块，import的时候会报错，找不到该模块，应该是已经将他们整合在一起了。

在Python 3以后的版本中，urllib2这个模块已经不单独存在（也就是说当你import urllib2时，系统提示你没这个模块），urllib2被合并到了urllib中，叫做urllib.request 和 urllib.error 。

urllib整个模块分为urllib.request, urllib.parse, urllib.error。

例：

其中

urllib2.urlopen()变成了urllib.request.urlopen()

urllib2.Request()变成了urllib.request.Request()

urllib和urllib2模块之间的区别

在python中，urllib和urllib2不可相互替代的。

整体来说，urllib2是urllib的增强，但是urllib中有urllib2中所没有的函数。

urllib2可以用urllib2.openurl中设置Request参数，来修改Header头。如果你访问一个网站，想更改User Agent（可以伪装你的浏览器），你就要用urllib2.

urllib支持设置编码的函数，urllib.urlencode,在模拟登陆的时候，经常要post编码之后的参数，所以要想不使用第三方库完成模拟登录，你就需要使用urllib。

urllib一般和urllib2一起搭配使用

1.urllib.urlopen(url[,data[,proxies]])

打开一个url的方法，返回一个文件对象，然后可以进行类似文件对象的操作。本例试着打开baidu

import urllib

f = urllib.urlopen('http://www.baidu.com')

firstLine = f.readline() #读取html页面的第一行

print firstLine

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

geturl()：返回请求的url

2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename，则会存为临时文件。

urlretrieve()返回一个二元组(filename,mine_hdrs)

临时存放：

#需要指定存放地址可用 filename = urllib.urlretrieve('http://www.baidu.com',filename='/path/baidu.html')

import urllib

filename = urllib.urlretrieve('http://www.baidu.com')

type(filename)

filename[0]

filename[1]

3.urllib.urlcleanup()

清除由于urllib.urlretrieve()所产生的缓存

4.urllib.quote(url)和urllib.quote_plus(url)

将url数据获取之后，并将其编码，从而适用与URL字符串中，使其能被打印和被web服务器接受。

#屏蔽url中特殊的字符（包括中文），把需要编码的字符转化为 %xx 的形式

urllib.quote('http://www.baidu.com')

urllib.quote_plus('http://www.baidu.com')

5.urllib.unquote(url)和urllib.unquote_plus(url)

与4的函数相反

6.urllib.urlencode(query)

将URL中的键值对以连接符&划分

将dict或者包含两个元素的元组列表转换成url参数。例如字典{'name': 'dark-bull', 'age': 200}将被转换为"name=dark-bull&age=200"

这里可以与urlopen结合以实现post方法和get方法：

GET 方法：

import urllib

params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})

#params'eggs=2&bacon=0&spam=1'

f=urllib.urlopen("http://python.org/query?%s" % params)

print f.read()

import urllib

data = urllib.parse.urlencode(params).encode('utf-8')

req = urllib.request.Request(url, data)

req.add_header('Content-Type', "application/x-www-form-urlencoded")

response = urllib.request.urlopen(req)

the_page = response.read().decode('utf-8')

print(the_page)

如果不做encode，会直接报错：POST data should be bytes or an iterable of bytes. It cannot be of type str.

如果不做decode，看到的都是assic码

urllib模块：

urllib.urlopen(url[,data[,proxies]]) 打开一个url的方法，返回一个文件对象，然后可以进行类似文件对象的操作

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

geturl()：返回请求的url

urllib.urlencode() 将URL中的键值对以连接符&划分,暂时不支持urldecode();注意：urlencode的参数必须是Dictionary

如：urllib.urlencode({'spam':1,'eggs':2,'bacon':0})

结果为：eggs=2&bacon=0&spam=1

urllib.quote(url)和urllib.quote_plus(url) 将url数据获取之后，并将其编码，从而适用与URL字符串中，使其能被打印和被web服务器接受

如：

print urllib.quote('http://www.baidu.com')

print urllib.quote_plus('http://www.baidu.com')

结果分别为：

http%3A//www.baidu.com

http%3A%2F%2Fwww.baidu.com

urllib.unquote(url)和urllib.unquote_plus(url) 与上面正好相反

urllib2模块：

直接请求一个url地址：

urllib2.urlopen(url, data=None) 通过向指定的URL发出请求来获取数据

构造一个request对象信息，然后发送请求：

urllib2.Request(url,data=None,header={},origin_req_host=None) 功能是构造一个请求信息，返回的req就是一个构造好的请求

urllib2.urlopen(url, data=None) 功能是发送刚刚构造好的请求req，并返回一个文件类的对象response，包括了所有的返回信息

response.read() 可以读取到response里面的html

response.info() 可以读到一些额外的响应头信息

主要区别：

urllib2可以接受一个Request类的实例来设置URL请求的headers，urllib仅可以接受URL。这意味着，你不可以通过urllib模块伪装你的User Agent字符串等（伪装浏览器）。

urllib提供urlencode方法用来GET查询字符串的产生，而urllib2没有。这是为何urllib常和urllib2一起使用的原因。

urllib2模块比较优势的地方是urlliburllib2.urlopen可以接受Request对象作为参数，从而可以控制HTTP Request的header部。

但是urllib.urlretrieve函数以及urllib.quote等一系列quote和unquote功能没有被加入urllib2中，因此有时也需要urllib的辅助

实例：

import urllib

import urllib2

from sys import exit

murl = "http://zhpfbk.blog.51cto.com/"

UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2896.3 Safari/537.36"

### 设置传入的参数，内容为一个dic

value = {'value1':'tkk','value2':'abcd'}

### 对value进行url编码

data = urllib.urlencode(value)

### 设置一个http头，格式为一个dic

header = {'User-Agent':UserAgent}

### 设置一个请求信息对象

req = urllib2.Request(murl,data,header)

print req.get_method()

### 以下内容为发送请求，并处理报错

try:

### 发送请求

resp = urllib2.urlopen(req)

### 获取HTTPError报错，必须设置在URLError之前，包含两个对象，code和reson

except urllib2.HTTPError,e:

print "HTTPError Code: ",e.code

sys.exit(2)

### 获取URLRrror报错，包含一个对象：reson

except urllib2.URLError,reson:

print "Reason: ",reson

sys.exit(2)

else:

### 获取请求的URL地址

print resp.geturl()

### 读取50字节数据

print resp.read(50)

### 获取响应码

print resp.getcode()

### 获取服务器响应头信息

print resp.info()

### 对url进行编码，转化成：http%3A%2F%2Fwww.baidu.com

print urllib.quote_plus('http://www.baidu.com')

print "No problem."

import urllib2

response = urllib2.urlopen('http://python.org/')

html = response.read()

import urllib2

req = urllib2.Request('http://www.pythontab.com')

response = urllib2.urlopen(req)

the_page = response.read()

import urllib

import urllib2

url = 'http://www.pythontab.com'

values = {'name' : 'Michael Foord',

'location' : 'pythontab',

'language' : 'Python' }

data = urllib.urlencode(values)

req = urllib2.Request(url, data)

response = urllib2.urlopen(req)

the_page = response.read()

>>> import urllib2

>>> import urllib

>>> data = {}

>>> data['name'] = 'Somebody Here'

>>> data['location'] = 'pythontab'

>>> data['language'] = 'Python'

>>> url_values = urllib.urlencode(data)

>>> print url_values

name=blueelwang+Here&language=Python&location=pythontab

>>> url = 'http://www.pythontab.com'

>>> full_url = url + '?' + url_values

>>> data = urllib2.open(full_url)

import urllib

import urllib2

url = 'http://www.pythontab.com'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {'name' : 'Michael Foord',

'location' : 'pythontab',

'language' : 'Python' }

headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)

req = urllib2.Request(url, data, headers)

response = urllib2.urlopen(req)

the_page = response.read()

1、获取url参数。

>>> from urllib import parse

>>> url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'

>>> parseResult = parse.urlparse(url)

>>> parseResult

ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')

>>> param_dict = parse.parse_qs(parseResult.query)

>>> param_dict

{'q': ['parse'], 'check_keywords': ['yes'], 'area': ['default']}

>>> q = param_dict['q'][0]

>>> q

'parse'

#注意：加号会被解码，可能有时并不是我们想要的

>>> parse.parse_qs('proxy=183.222.102.178:8080&task=XXXXX|5-3+2')

{'proxy': ['183.222.102.178:8080'], 'task': ['XXXXX|5-3 2']}

2、parse_qs/parse_qsl

>>> from urllib import parse

>>> parse.parse_qs('action=addblog&job=modify&tid=1766670')

{'tid': ['1766670'], 'action': ['addblog'], 'job': ['modify']} #注意和第三个并不一样

>>> parse.parse_qsl('action=addblog&job=modify&tid=1766670')

[('action', 'addblog'), ('job', 'modify'), ('tid', '1766670')]

>>> dict(parse.parse_qsl('action=addblog&job=modify&tid=1766670')) #注意和第一个并不一样

{'tid': '1766670', 'action': 'addblog', 'job': 'modify'}

3、urlencode

>>> from urllib import parse

>>> query = {

'name': 'walker',

'age': 99,

}

>>> parse.urlencode(query)

'name=walker&age=99'

4、quote/quote_plus

>>> from urllib import parse

>>> parse.quote('a&b/c') #未编码斜线

'a%26b/c'

>>> parse.quote_plus('a&b/c') #编码了斜线

'a%26b%2Fc'

5、unquote/unquote_plus

from urllib import parse

>>> parse.unquote('1+2') #不解码加号

'1+2'

>>> parse.unquote('1+2') #把加号解码为空格

'1 2'

下面是一个简单的代码示例：

#encoding:UTF-8

import urllib.request

def getdata():

url="http://www.baidu.com"

data=urllib.request.urlopen(url).read()

print(data)

getdata()

中文转码，修改一下代码：

#encoding:UTF-8

import urllib.request

def getdata():

url="http://www.baidu.com"

data=urllib.request.urlopen(url).read()

z_data=data.decode('UTF-8')

print(z_data)

getdata()

Python3 urllib GET方式获取数据

GET方式示例【百度搜索】

#encoding:UTF-8

import urllib

import urllib.request

#数据字典

data={}

data['word']='python3'

#注意Python2.x的区别

url_values=urllib.parse.urlencode(data)

print(url_values)

url="http://www.baidu.com/s?"

full_url=url+url_values

data=urllib.request.urlopen(full_url).read()

z_data=data.decode('UTF-8')

print(z_data)

python3 urllib.request 网络请求操作

基本的网络请求示例

import urllib.request

#请求百度网页

resu = urllib.request.urlopen('http://www.baidu.com', data = None, timeout = 10)

print(resu.read(300))

#指定编码请求

with urllib.request.urlopen('http://www.baidu.com') as resu:

print(resu.read(300).decode('GBK'))

#指定编码请求

f = urllib.request.urlopen('http://www.baidu.com')

print(f.read(100).decode('utf-8'))

发送数据请求，CGI程序处理

import urllib.request

req = urllib.request.Request(url='https://localhost/cgi-bin/test.cgi',

... data=b'This data is passed to stdin of the CGI')

f = urllib.request.urlopen(req)

print(f.read().decode('utf-8'))

Got Data: "This data is passed to stdin of the CGI"

PUT请求

import urllib.request

DATA=b'some data'

req = urllib.request.Request(url='http://localhost:8080', data=DATA,method='PUT')

f = urllib.request.urlopen(req)

print(f.status)

print(f.reason)

基本的HTTP验证，登录请求

import urllib.request

# Create an OpenerDirector with support for Basic HTTP Authentication...

auth_handler = urllib.request.HTTPBasicAuthHandler()

auth_handler.add_password(realm='PDQ Application',

uri='https://mahler:8092/site-updates.py',

user='klem',

passwd='kadidd!ehopper')

opener = urllib.request.build_opener(auth_handler)

# ...and install it globally so it can be used with urlopen.

urllib.request.install_opener(opener)

urllib.request.urlopen('http://www.example.com/login.html')

支持代理方式验证请求

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})

proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()

proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)

# This time, rather than install the OpenerDirector, we use it directly:

opener.open('http://www.example.com/login.html')

添加 http headers

import urllib.request

req = urllib.request.Request('http://www.example.com/')

req.add_header('Referer', 'http://www.python.org/')

r = urllib.request.urlopen(req)

添加 user-agent

import urllib.request

opener = urllib.request.build_opener()

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

opener.open('http://www.example.com/')

带参数的GET 请求

import urllib.request

import urllib.parse

params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})

f = urllib.request.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)

print(f.read().decode('utf-8'))

带参数的POST请求

import urllib.request

import urllib.parse

data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})

data = data.encode('utf-8')

request = urllib.request.Request("http://requestb.in/xrbl82xr")

# adding charset parameter to the Content-Type header.

request.add_header("Content-Type","application/x-www-form-urlencoded;charset=utf-8")

f = urllib.request.urlopen(request, data)

print(f.read().decode('utf-8'))

指定代理方式请求

import urllib.request

proxies = {'http': 'http://proxy.example.com:8080/'}

opener = urllib.request.FancyURLopener(proxies)

f = opener.open("http://www.python.org")

f.read().decode('utf-8')

无添加代理

import urllib.request

opener = urllib.request.FancyURLopener({})

f = opener.open("http://www.python.org/")

f.read().decode('utf-8')

Python3中urllib详细使用方法(header,代理,超时,认证,异常处理)

urllib是python的一个获取url(Uniform Resource Locators,统一资源定址器)了，我们可以利用它来抓取远程的数据进行保存哦，下面整理了一些关于urllib使用中的一些关于header,代理,超时,认证,异常处理处理方法，下面一起来看看。

python3 抓取网页资源的 N 种方法

1、最简单

import urllib.request

response = urllib.request.urlopen('http://python.org/')

html = response.read()

2、使用 Request

import urllib.request

req = urllib.request.Request('http://python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

3、发送数据

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

'act' : 'login',

'login[email]' : 'yzhang@i9i8.com',

'login[password]' : '123456'

}

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data)

req.add_header('Referer', 'http://www.python.org/')

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3

import urllib.parse

import urllib.request

url = 'http://localhost/login.php'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {

'act' : 'login',

'login[email]' : 'yzhang@i9i8.com',

'login[password]' : '123456'

}

headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url, data, headers)

response = urllib.request.urlopen(req)

the_page = response.read()

print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3

import urllib.request

req = urllib.request.Request('http://www.111cn.net ')

try:

urllib.request.urlopen(req)

except urllib.error.HTTPError as e:

print(e.code)

print(e.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3

import urllib.request

# create a password manager

password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.

# If we knew the realm, we could use it instead of None.

top_level_url = "https://www.111cn.net /"

password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)

opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL

a_url = "https://www.111cn.net /"

x = opener.open(a_url)

print(x.read())

# Install the opener.

# Now all calls to urllib.request.urlopen use our opener.

urllib.request.install_opener(opener)

a = urllib.request.urlopen(a_url).read().decode('utf8')

print(a)

9、使用代理

#! /usr/bin/env python3

import urllib.request

proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})

opener = urllib.request.build_opener(proxy_support)

urllib.request.install_opener(opener)

a = urllib.request.urlopen("http://www.111cn.net ").read().decode("utf8")

print(a)

10、超时

#! /usr/bin/env python3

import socket

import urllib.request

# timeout in seconds

timeout = 2

socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout

# we have set in the socket module

req = urllib.request.Request('http://www.111cn.net /')

a = urllib.request.urlopen(req).read()

print(a)

① 在Python中通过HTTP下载东西是非常简单的; 实际上，只需要一行代码。urllib.request模块有一个方便的函数urlopen() ，它接受你所要获取的页面地址，然后返回一个类文件对象，您只要调用它的read()方法就可以获得网页的全部内容。没有比这更简单的了。

② urlopen().read()方法总是返回bytes对象,而不是字符串。记住字节仅仅是字节，字符只是一种抽象。 HTTP 服务器不关心抽象的东西。如果你请求一个资源，你得到字节。如果你需要一个字符串，你需要确定字符编码,并显式的将其转化成字符串。

通过BeautifulSoup 的 find_all方法，找出所有a标签中的href属性中包含http的内容，这就是我们要找的网页的一级链接（这里不做深度遍历链接）

并返回符合上述条件的a标签的href属性的内容，这就是我们要找的某个网页的所带有的一级链接

1.1 导入模块

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup #process html

from BeautifulSoup import BeautifulStoneSoup #process xml

import BeautifulSoup #all

1.2 创建对象

doc = ['<html><head>hello<title>PythonClub.org</title></head>',

'<body>This is paragraph one of ptyhonclub.org.',

'This is paragraph two of pythonclub.org.',

'</html>']

soup = BeautifulSoup(''.join(doc))

指定编码：

htmlCharset = "GB2312"

soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset)

2.获取tag内容（以下3种方法等价）

head = soup.find('head')

head = soup.head

head = soup.contents[0].contents[0]

print head

3.获取关系节点

3.1 使用parent获取父节点

body = soup.body

html = body.parent # html是body的父亲

3.2 使用nextSibling, previousSibling获取前后兄弟

head = body.previousSibling # head和body在同一层，是body的前一个兄弟

p1 = body.contents[0] # p1, p2都是body的儿子，我们用contents[0]取得p1

p2 = p1.nextSibling # p2与p1在同一层，是p1的后一个兄弟, 当然body.content[1]也可得到

4.find/findAll用法详解

函数原型：find(name=None, attrs={}, recursive=True, text=None, **kwargs)，findAll会返回所有符合要求的结果，并以list返回

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了

4.1 tag搜索

find(tagname) # 直接搜索名为tagname的tag 如：find('head')

find(list) # 搜索在list中的tag，如: find(['head', 'body'])

find(dict) # 搜索在dict中的tag，如:find({'head':True, 'body':True})

find(re.compile('')) # 搜索符合正则的tag, 如:find(re.compile('^p')) 搜索以p开头的tag

find(lambda) # 搜索函数返回结果为true的tag, 如:find(lambda name: if len(name) == 1) 搜索长度为1的tag

find(True) # 搜索所有tag，但是不会返回字符串节点

findAll(name, attrs, recursive, text, limit, **kwargs)

示例:

a = urllib2.urlopen('http://www.baidu.com')

b = a.read().decode('utf-8')

soup = BeautifulSoup.BeautifulStoneSoup(b,convertEntities=BeautifulSoup.BeautifulStoneSoup.ALL_ENTITIES).html.head.findAll('link')

for i in soup:

print i['href']

4.2 attrs搜索

find(id='xxx') # 寻找id属性为xxx的

find(attrs={id=re.compile('xxx'), algin='xxx'}) # 寻找id属性符合正则且algin属性为xxx的

find(attrs={id=True, algin=None}) # 寻找有id属性但是没有algin属性的

4.3 text搜索

文字的搜索会导致其他搜索给的值如：tag, attrs都失效。方法与搜索tag一致

print p1.text

# u'This is paragraphone.'

print p2.text

# u'This is paragraphtwo.'

# 注意：1，每个tag的text包括了它以及它子孙的text。2，所有text已经被自动转为unicode，如果需要，可以自行转码encode(xxx)

4.4 recursive和limit属性

recursive=False表示只搜索直接儿子，否则搜索整个子树，默认为True

当使用findAll或者类似返回list的方法时，limit属性用于限制返回的数量，如findAll('p', limit=2)：返回首先找到的两个tag

#!/opt/yrd_soft/bin/python

import re

import urllib2

import requests

import lxml

from bs4 import BeautifulSoup

url = 'http://www.baidu.com'

#page=urllib2.urlopen(url)

page=requests.get(url).text

pagesoup=BeautifulSoup(page,'lxml')

for link in pagesoup.find_all(name='a',attrs={"href":re.compile(r'^http:')}):

#print type(link)

print link.get('href')

#encoding=utf-8

#author: walker

#date: 2014-11-26

#summary: 使用BeautifulSoup获取url及其内容

import sys, re, requests, urllib

from bs4 import BeautifulSoup

reload(sys)

sys.setdefaultencoding('utf8')

#给定关键词，获取百度搜索的结果

def GetList(keyword):

keyword = unicode(keyword, 'gb18030')

dic = {'wd': keyword}

urlwd = urllib.urlencode(dic)

print(urlwd)

sn = requests.Session()

url = 'http://www.baidu.com/s?ie=utf-8&csq=1&pstg=22&mod=2&isbd=1&cqid=9c0f47b700036f17&istc=8560&ver=0ApvSgUI_ODaje7cp4DVye9X2LZqWiCPEIS&chk=54753dd5&isid=BD651248E4C31919&'

url += urlwd

url += '&ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&rsv_pq=b05765d70003b6c0&rsv_t=ce54Z5LOdER%2Fagxs%2FORKVsCT6cE0zvMTaYpqpgprhExMhsqDACiVefXOze4&_ck=145469.1.129.57.22.735.37'

r = sn.get(url=url)

soup = BeautifulSoup(r.content) #r.text很可能中文乱码

rtn = soup.find('div',id='content_left').find_all(name='a',href=re.compile('baidu.com'))

for item in rtn:

print(item.getText().encode('gb18030'))

print(item['href'])

if __name__ == '__main__':

keyword = '正则表达式'

GetList(keyword)

关键字：

上一篇： python之os.walk()与os.

下一篇： python随机生成字符串学习



搜索

热门推荐

最新文章

Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2027°
Browser-use:智能浏览器自动化(Web-Agent)
 2741°
使用 LangChain 实现本地 Agent
 2287°
使用 LangChain 构建本地 RAG 应用
 2215°
使用LLaMA-Factory微调大模型的function calling能力
 2693°
复现一个简单Agent系统
 2242°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 2989°
LLaMA Factory微调后的模型合并导出和部署-4
 4931°
LLaMA Factory微调模型的各种参数怎么设置-3
 4784°
LLaMA Factory构建高质量数据集-2
 3404°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客