by liuqq on 二.22, 2010, under 未分类
终于得到了能查看的网页内容,查了很多资料,最终得到了想要的结果。从此以后“巧妇难为无米之炊”的时代终于结束了!
还是Java的用着比较舒服,既省力又好用。也是今天才发现,nutch是得不到网页源码的,因为,它已经把网页进行了索引放进了数据库中。利用jspider不仅得到了网页的源码还有网站的整体架构,分成了不同的文件夹。
这是利用jspider得到的淘宝首页解析:
可以尝试MetaSeeker工具包,最适合将网页内容抽取成结构化数据,便于集成到自己的网站或者做数据挖掘
Name (required)
Mail (will not be published) (required)
Website
Thanks for dropping by! Feel free to join the discussion by leaving comments, and stay updated by subscribing to the RSS feed. See ya around!
Use the form below to search the site:
Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!
A few highly recommended websites...
All entries, chronologically...
四月 8th, 2010 on 11:57 上午
可以尝试MetaSeeker工具包,最适合将网页内容抽取成结构化数据,便于集成到自己的网站或者做数据挖掘