基于MapReduce的网络爬虫设计与实现

doi:10.3976/j.issn.1002-4026.2015.02.017

Abstract

Abstract: We design and implement a MapReduce based web crawler system for such issues as low efficiency and bad scalability of a single crawler system. It employs HDFS and HBase to store web information and extracts web information through a row block distribution function. It then measures similarity for acquired web information by Simhash algorithm and deduplication strategy of similarity analysis of URL and web information. Experimental results show that it has better performance and scalability, and increases average crawling speed by 4.8 times, as compared with single crawling system.

Key words: Hadoop, information extraction, text deduplication, MapReduce, web crawler

CLC Number:

TP311.1

LI Chen, ZHU Shiwei, Zhao Yanqing,YU Junfeng . MapReduce based web crawler design and implementation[J].SHANDONG SCIENCE, 2015, 28(2): 101-107.

References

Metrics

Comments

Recommended 0

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), which permits third parties to freely share (i.e., copy and redistribute the material in any medium or format) and adapt (i.e., remix, transform, or build upon the material) the articles published in this journal, provided that appropriate credit is given, a link to the license is provided, and any changes made are indicated. The material may not be used for commercial purposes. For details of the CC BY-NC 4.0 license, please visit: https://creativecommons.org/licenses/by-nc/4.0

[1]	FAN Zhongyong,ZHANG Zhijun,ZHANG Pengfei. Application of ontology technology in personalized recommendation system [J]. SHANDONG SCIENCE, 2016, 29(2): 101-105.
[2]	JIANG Peng. Hadoop based online query and analysis platform [J]. SHANDONG SCIENCE, 2015, 28(5): 115-119.
[3]	LI Yan-gai, ZHAO Hua-wei. PKI based HDFS authentication and secure transmission mechanism [J]. SHANDONG SCIENCE, 2014, 27(5): 33-41.
[4]	LAI Jian-Mei, CAO Hui, MA Jin-Gang. The research and application of information extraction in the field of traditional Chinese medicine [J]. Shandong Science, 2011, 24(6): 88-91.

MapReduce based web crawler design and implementation

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 4

Metrics

Comments

Recommended 0