基于MapReduce的网络爬虫设计与实现

doi:10.3976/j.issn.1002-4026.2015.02.017

山东科学 ›› 2015, Vol. 28 ›› Issue (2): 101-107.doi: 10.3976/j.issn.1002-4026.2015.02.017

基于MapReduce的网络爬虫设计与实现

李晨，朱世伟，赵燕清，于俊凤

山东省科学院情报研究所,山东济南 250014

收稿日期:2015-01-21 出版日期:2015-04-20 发布日期:2015-04-20
作者简介:李晨（1988-），男，研究实习员，硕士，研究方向为数据挖掘、大数据分析。Email：jncqlc@163.com
基金资助:
山东省科学院青年基金（2013QN036）；山东省科技发展计划(2013GGX10127； 2014GGX101013）

MapReduce based web crawler design and implementation

LI Chen, ZHU Shi-wei, Zhao Yan-qing,YU Jun-feng

Information Institute,Shandong Academy of Sciences,Jinan 250014,China

Received:2015-01-21 Online:2015-04-20 Published:2015-04-20

摘要/Abstract

摘要： 针对单机爬虫效率低、可扩展性差等问题，本文设计并实现了一种基于MapReduce的网络爬虫系统。该系统首先采用HDFS和HBase对网页信息进行存储管理，基于行块分布函数的方法进行网页信息抽取；然后通过URL和网页信息相似度分析相结合的去重策略，采用Simhash算法对抓取的网页信息进行相似度度量。实验结果表明,该系统具有良好的性能和可扩展性，较单机爬虫相比平均抓取速度提高了4.8倍。

关键词: 网络爬虫, 信息抽取, 文本去重, Hadoop, MapReduce

Abstract: We design and implement a MapReduce based web crawler system for such issues as low efficiency and bad scalability of a single crawler system. It employs HDFS and HBase to store web information and extracts web information through a row block distribution function. It then measures similarity for acquired web information by Simhash algorithm and deduplication strategy of similarity analysis of URL and web information. Experimental results show that it has better performance and scalability, and increases average crawling speed by 4.8 times, as compared with single crawling system.

Key words: Hadoop, information extraction, text deduplication, MapReduce, web crawler

中图分类号:

TP311.1

李晨，朱世伟，赵燕清，于俊凤. 基于MapReduce的网络爬虫设计与实现[J]. 山东科学, 2015, 28(2): 101-107.

LI Chen, ZHU Shiwei, Zhao Yanqing,YU Junfeng . MapReduce based web crawler design and implementation[J]. SHANDONG SCIENCE, 2015, 28(2): 101-107.

[1]	范忠勇，张志军，张鹏飞. 本体技术在个性化推荐系统中的应用研究[J]. 山东科学, 2016, 29(2): 101-105.
[2]	姜鹏. 基于Hadoop的即时查询分析平台[J]. 山东科学, 2015, 28(5): 115-119.
[3]	李延改,赵华伟. 基于PKI的HDFS认证及安全传输机制研究[J]. 山东科学, 2014, 27(5): 33-41.
[4]	来建梅, 曹慧, 马金刚. 中医药领域信息抽取技术的研究与应用[J]. 山东科学, 2011, 24(6): 88-91.

基于MapReduce的网络爬虫设计与实现

MapReduce based web crawler design and implementation

PDF (PC)

赞

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

Metrics

本文评价

推荐阅读 0