本文讲述了Java教程——使用xpath来进行网页爬虫!具有很好的参考价值,希望对大家有所帮助。一起跟随六星小编过来看看吧,具体如下:
今日主题:java使用xpath来进行网页爬虫
我一直在寻找一种爬取网页比较方便的方式,今天我找到了,我发现用xpath来解析网页是非常不错的。
依赖
<!--xsoup-->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.3.2</version>
</dependency>
xsoup其实是整合了jsoup的,所以只需要引用这个依赖就行了。
参考:http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/xsoup.html
测试代码
我们在爬取网页内容时,可以用对某段代码就行右键,复制xpath路径。
右键这段代码进行xpath复制。
举例:我们要爬取某篇文章的内容:https://www.cls.cn/detail/973228。
//财联社单篇文章地址
Document document = Jsoup.parse(HttpUtil.get(
"https://www.cls.cn/detail/973228"
));
//标题
System.out.println(Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[1]"
).evaluate(document).getElements().get(0).text());
System.out.println(
"--------------------------------------------"
);
//内容
System.out.println(Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[2]"
).evaluate(document).getElements().get(0).text());
//System.out.println(
"--------------------------------------------"
);
//System.out.println(Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]"
).evaluate(document).get());
System.out.println(
"--------------------------------------------"
);
//这里直接写div表示所有的div,图片
Elements elements = Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div"
).evaluate(document).getElements();
for
(Element element:elements){
System.out.println(element.select(
"img"
).attr(
"src"
));
}
System.out.println(
"--------------------------------------------"
);
List<String> list = Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div"
).evaluate(document).list();
System.out.println(list);
爬取财联社电报:
Document document = Jsoup.parse(HttpUtil.get(
"https://www.cls.cn/telegraph"
));
//System.out.println(Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div"
).evaluate(document).getElements());
Elements elements = Xsoup.compile(
"//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div"
).evaluate(document).getElements();
for
(Element element:elements){
Document document1 = Jsoup.parse(element.toString());
Elements elements1=Xsoup.compile(
"//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]"
).evaluate(document1).getElements();
if
(null!=elements1 && elements1.size()>0){
System.out.println(Xsoup.compile(
"//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]"
).evaluate(document1).getElements().get(0).select(
"span"
).text());
}
}
这样就可以把所有的文本内容爬取出来了。
爬取华尔街见闻xml文件:
Document document = Jsoup.parse(HttpUtil.get(
"https://dedicated.wallstreetcn.com/rss.xml"
));
Elements elements = Xsoup.compile(
"/html/body/rss/channel/item"
).evaluate(document).getElements();
for
(Element element:elements){
System.out.println(element.textNodes().get(2));
}
返回值:获取文章的地址:
https://wallstreetcn.com/articles/3655852
https://wallstreetcn.com/articles/3655850
https://wallstreetcn.com/articles/3655845
https://wallstreetcn.com/articles/3655851
https://wallstreetcn.com/articles/3655846
https://wallstreetcn.com/articles/3655844
https://wallstreetcn.com/articles/3655842
https://wallstreetcn.com/articles/3655831
https://wallstreetcn.com/articles/3655785
https://wallstreetcn.com/articles/3655820
https://wallstreetcn.com/articles/3655827
https://wallstreetcn.com/articles/3655830
https://wallstreetcn.com/articles/3655829
https://wallstreetcn.com/articles/3655824
https://wallstreetcn.com/articles/3655826
https://wallstreetcn.com/articles/3655825
https://wallstreetcn.com/articles/3655821
https://wallstreetcn.com/articles/3655817
https://wallstreetcn.com/articles/3655814
https://wallstreetcn.com/articles/3655812
https://wallstreetcn.com/articles/3655810
https://wallstreetcn.com/articles/3655802
https://wallstreetcn.com/articles/3655803
https://wallstreetcn.com/articles/3655793
https://wallstreetcn.com/articles/3655799
https://wallstreetcn.com/articles/3655798
https://wallstreetcn.com/articles/3655787
https://wallstreetcn.com/articles/3655790
https://wallstreetcn.com/articles/3655789
https://wallstreetcn.com/articles/3655782
https://wallstreetcn.com/articles/3655778
https://wallstreetcn.com/articles/3655746
https://wallstreetcn.com/articles/3655763
https://wallstreetcn.com/articles/3655774
https://wallstreetcn.com/articles/3655755
https://wallstreetcn.com/articles/3655771
https://wallstreetcn.com/articles/3655761
https://wallstreetcn.com/articles/3655734
https://wallstreetcn.com/articles/3655758
https://wallstreetcn.com/articles/3655749
Process finished with
exit
code 0
更多相关技术内容咨询欢迎前往并持续关注六星社区了解详情。
想高效系统的学习Java编程语言,推荐大家关注一个微信公众号:Java圈子。每天分享行业资讯、技术干货供大家阅读,关注即可免费领取整套Java入门到进阶的学习资料以及教程,感兴趣的小伙伴赶紧行动起来吧。
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!