抓取百度百科关键字

March 15, 2019

985 views

1628 words

根据之前学过的基础知识，现在爬取一些简单的静态页面是不成问题了。百度百科页面爬虫，爬取页面词条并打印。因为每个页面词条是很多的，所以每个页面随机取一个词，不断的往下爬取。总共分为3部，即爬取网页、分析网页数据、输出所需资源。组成本爬虫的关键模块分别是URL管理器、HTML下载器和HTML解析器。

github地址

BaiDuBaiKeSpider

百度百科页面爬虫
目前支持 Python2 和 Python3

使用方法

1.命令行

cd baike_spider_3.7
python spider_main.py

2.PyCharm

Open baike_spider
Run  'spider_main'

遇到的问题

1.地址问题

之前的地址
https://baike.baidu.com/view/10812319.htm
现在的地址
https://baike.baidu.com/item/python/407313

修改html_parser.py中的

links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm"))

改为

links = soup.find_all('a', href=re.compile(r"/item/*"))

2. https 问题

在 spider_main 中加上 import ssl

3.卡住不走，爬取几条就不动了

修改 html_downloader.py ,不明白的可以看我的代码 https://gist.github.com/lzcdev/e215870dd3430eb184beb5015f0b319d

 try:
    response = urllib2.urlopen(url, timeout=10)
        if response.getcode() != 200:
            print 'false'
            return None
        print 'success'
    except:
        print 'timeout'
    return response.read()

4. output.html 一片空白

可以尝试删掉 output.html 文件重新运行（前提是程序可以正常运行），可以把1000改少一点查看效果

5. Python3 中 urllib.request.urlopen 失败

修改 html_downloader.py ，引入 error 模块，查看具体错误。比如取消证书验证，具体可看 html_downloader.py 文件

架构

调度器->URL管理器->网页下载器->网页解析器->数据

用到的知识点

set(),可选方案（MySQL, redis）
urllib2
BeautifulSoup，可选方案（正则表达式，html.parser,lxml)

抓取百度百科关键字

Evan • 2019 年 03 月 15 日

抓取百度百科关键字

BaiDuBaiKeSpider

使用方法

1.命令行

2.PyCharm

遇到的问题

1.地址问题

2. https 问题

3.卡住不走，爬取几条就不动了

4. output.html 一片空白

5. Python3 中 urllib.request.urlopen 失败

架构

用到的知识点

Leave a Comment Cancel reply

深入理解alloc、init方法

Carthage的使用

用 Python 分析微信好友

Category相关

fastlane 自动化构建

如何写出好的 commit message

@property、@synthesize和@dynamic的区别

深入理解alloc、init方法

typeof与instanceof的区别

MacOS 搭建 DVWA 靶场

抓取百度百科关键字