SHANDONG SCIENCE ›› 2015, Vol. 28 ›› Issue (2): 101-107.doi: 10.3976/j.issn.1002-4026.2015.02.017

• Other Research Article • Previous Articles     Next Articles

MapReduce based web crawler design and implementation

LI Chen, ZHU Shi-wei, Zhao Yan-qing,YU Jun-feng   

  1. Information Institute,Shandong Academy of Sciences,Jinan 250014,China
  • Received:2015-01-21 Online:2015-04-20 Published:2015-04-20

Abstract: We design and implement a MapReduce based web crawler system for such issues as low efficiency and bad scalability of a single crawler system. It employs HDFS and HBase to store web information and extracts web information through a row block distribution function. It then measures similarity for acquired web information by Simhash algorithm and deduplication strategy of similarity analysis of URL and web information. Experimental results show that it has better performance and scalability, and increases average crawling speed by 4.8 times, as compared with single crawling system.

Key words: Hadoop, information extraction, text deduplication, MapReduce, web crawler

CLC Number: 

  • TP311.1