资 源 简 介
Web spam is web content generated with the intention of triggering an increased importance of some web page(s) which is not proportionate to the actual relevance of the content of the web page. It is one of the main challenges faced by web search engines. A typical web search engine indexes millions of web pages, hence it essential to develop automatic techniques for separating web spam from genuine content.
The goals of this project are:
---Provide an implementation of the WITCH algorithm which has shown state-of-the-art performance in the Web Spam Challenge 2008.
---Analyze the improvement in accuracy that can be obtained by combining several weak classifiers using the Weighted Majority Algorithm (WMA). This technique has the following advantages:
------The Weighted Majority algorithm is efficient, easy to implement and analyze theoretically.
------The weak classifiers usually use simple techniques and less number of features, which usually makes them time efficien