应用背景这个代码是我们在我们学院的语义网技术学科中的一部分。这个代码是一个非常基本的尝试,试图从网页中删除广告,只显示相关的文本。我们删除广告、Flash及其他JavaScript等,只用文字显示。此代码使用Python语言因为它减少程序员的编码工作提供了许多图书馆侧。关键技术Web 已经成为最大的信息来源,有数十亿网页。然而,
web 页通常包含一些与主题无关的内容。例如,
有这么多的多媒体广告段、 不必要的图像或导航链接
在 Web 页中。这些部件可以严重损害 Web 数据挖掘,分散注意力从主要用户
主题,并影响 PageRank。有一些现有的方法来发现翔实的内容块。最简单的方法是辨识和消除杂波,广告、 装饰等。
SHOW FULL COLUMNS FROM `jrk_downrecords` [ RunTime:0.001104s ]
SELECT `a`.`aid`,`a`.`title`,`a`.`create_time`,`m`.`username` FROM `jrk_downrecords` `a` INNER JOIN `jrk_member` `m` ON `a`.`uid`=`m`.`id` WHERE `a`.`status` = 1 GROUP BY `a`.`aid` ORDER BY `a`.`create_time` DESC LIMIT 10 [ RunTime:0.086174s ]
SHOW FULL COLUMNS FROM `jrk_tagrecords` [ RunTime:0.001172s ]
SELECT * FROM `jrk_tagrecords` WHERE `status` = 1 ORDER BY `num` DESC LIMIT 20 [ RunTime:0.001086s ]
SHOW FULL COLUMNS FROM `jrk_member` [ RunTime:0.001027s ]
SELECT `id`,`username`,`userhead`,`usertime` FROM `jrk_member` WHERE `status` = 1 ORDER BY `usertime` DESC LIMIT 10 [ RunTime:0.003010s ]
SHOW FULL COLUMNS FROM `jrk_searchrecords` [ RunTime:0.000885s ]
SELECT * FROM `jrk_searchrecords` WHERE `status` = 1 ORDER BY `num` DESC LIMIT 5 [ RunTime:0.003607s ]
SELECT aid,title,count(aid) as c FROM `jrk_downrecords` GROUP BY `aid` ORDER BY `c` DESC LIMIT 10 [ RunTime:0.016781s ]
SHOW FULL COLUMNS FROM `jrk_articles` [ RunTime:0.001006s ]
UPDATE `jrk_articles` SET `hits` = 2 WHERE `id` = 86610 [ RunTime:0.028249s ]