博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
mechanize (1)
阅读量:4640 次
发布时间:2019-06-09

本文共 4578 字,大约阅读时间需要 15 分钟。

最近看的关于网络爬虫和模拟登陆的资料,发现有这样一个包

 

mechanize ['mekə.naɪz]又称为机械化的意思,确实文如其意,确实有自动化的意思。

mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:

  • any URL can be opened, not just http:

  • mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().

  • Easy HTML form filling.

  • Convenient link parsing and following.

  • Browser history (.back() and .reload() methods).

  • The Referer HTTP header is added properly (optional).

  • Automatic observance of .

  • Automatic handling of HTTP-Equiv and Refresh.

意思就是说 mechanize.Browser和mechanize.UserAgentBase只是urllib2.OpenerDirector的接口实现,因此,包括HTTP协议,所有的协议都可以打开

另外,提供了更简单的配置方式而不用每次都创建一个新的OpenerDirector

对表单的操作,对链接的操作、浏览历史和重载操作、刷新、对robots.txt的监视操作等等

import reimport mechanize (1)实例化一个浏览器对象br = mechanize.Browser()(2)打开一个网址 br.open("http://www.example.com/")(3)该网页下的满足text_regex的第2个链接 # follow second link with element text matching regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)assert br.viewing_html()(4)网页的名称 print br.title()(5)将网页的网址打印出来 print response1.geturl()(6)网页的头部 print response1.info()  # headers(7)网页的body print response1.read()  # body (8)选择body中的name =" order"的FORMbr.select_form(name="order")# Browser passes through unknown attributes (including methods)# to the selected HTMLForm. (9)为name = cheeses的form赋值br["cheeses"] = ["mozzarella", "caerphilly"]  # (the method here is __setitem__)# Submit current form.  Browser calls .close() on the current response on# navigation, so this closes response1(10)提交 response2 = br.submit()# print currently selected form (don't call .submit() on this, use br.submit())print br.form(11)返回response3 = br.back()  # back to cheese shop (same data as response1)# the history mechanism returns cached response objects# we can still use the response, even though it was .close()d response3.get_data()  # like .seek(0) followed by .read()(12)刷新网页 response4 = br.reload()  # fetches from server(13)这可以列出该网页下所有的Form for form in br.forms():  print form# .links() optionally accepts the keyword args of .follow_/.find_link()for link in br.links(url_regex="python.org"):print link    br.follow_link(link)  # takes EITHER Link instance OR keyword args    br.back()

 这是文档中给出的一个例子,基本的解释已经在代码中给出

You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:

通过mechanize.UserAgent这个模块,我们可以实现对browser’s policy的控制,代码给出如下,也是来自与文档的例子:

br = mechanize.Browser()# Explicitly configure proxies (Browser will attempt to set good defaults).# Note the userinfo ("joe:password@") and port number (":3128") are optional.br.set_proxies({"http": "joe:password@myproxy.example.com:3128","ftp": "proxy.example.com", })# Add HTTP Basic/Digest auth username and password for HTTP proxy access.# (equivalent to using "joe:password@..." form above) br.add_proxy_password("joe", "password") # Add HTTP Basic/Digest auth username and password for website access.br.add_password("http://example.com/protected/", "joe", "password") # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML).br.set_handle_equiv(False) # Ignore robots.txt.  Do not do this without thought and consideration.br.set_handle_robots(False) # Don't add Referer (sic) headerbr.set_handle_referer(False) # Don't handle Refresh redirectionsbr.set_handle_refresh(False) # Don't handle cookiesbr.set_cookiejar() # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by# default: no need to do this unless you have some reason to use a# particular cookiejar)br.set_cookiejar(cj) # Log information about HTTP redirects and Refreshes.br.set_debug_redirects(True) # Log HTTP response bodies (ie. the HTML, most of the time).br.set_debug_responses(True) # Print HTTP headers.br.set_debug_http(True)# To make sure you're seeing all debug output:logger = logging.getLogger("mechanize")logger.addHandler(logging.StreamHandler(sys.stdout))logger.setLevel(logging.INFO)# Sometimes it's useful to process bad headers or bad HTML:response = br.response()  # this is a copy of responseheaders = response.info()  # currently, this is a mimetools.Messageheaders["Content-type"] = "text/html; charset=utf-8"response.set_data(response.get_data().replace("

 另外,还有一些类似于mechanize的网页交互模块,

There are several wrappers around mechanize designed for functional testing of web applications:

归根到底,都是对urllib2的封装,因此,选择一个比较好用的模块就好了!

转载于:https://www.cnblogs.com/CBDoctor/p/3855738.html

你可能感兴趣的文章
JasperReport报表设计4
查看>>
项目活动定义 概述
查看>>
团队冲刺04
查看>>
我的Python分析成长之路8
查看>>
泛型在三层中的应用
查看>>
SharePoint2010 -- 管理配置文件同步
查看>>
.Net MVC3中取得当前区域的名字(Area name)
查看>>
获得屏幕像素以及像素密度
查看>>
int与string转换
查看>>
adb命令 判断锁屏
查看>>
推荐一个MacOS苹果电脑系统解压缩软件
查看>>
1035等差数列末项计算
查看>>
CDMA鉴权
查看>>
ASP.NET MVC Identity 兩個多個連接字符串問題解決一例
查看>>
过滤器与拦截器区别
查看>>
USACO 1.5.4 Checker Challenge
查看>>
第二阶段站立会议7
查看>>
[18]Debian Linux Install GNU GCC Compiler and Development Environment
查看>>
JAVA多线程
查看>>
ACE(Adaptive Communication Environment)介绍
查看>>