大家好,欢迎来到IT知识分享网。
文章目录
注意: 对于百度翻译、百度搜索、腾讯翻译等页面依然抓取不了结果,对于加密的JS文件解析基本不生效 — 推荐使用Selenium爬复杂JS、以及加密JS页面的内容
1. 概述
官方文档: https://htmlunit.sourceforge.io/
有具体Demo的讲解文档(搭配官方文档效果更佳):https://www.scrapingbee.com/java-webscraping-book/
作用: 一个”用于Java程序的无GUI浏览器”。它对HTML文档进行建模,并提供一个API,允许您调用页面,填写表单,单击链接等…就像您在”正常”浏览器中所做的那样
2. 注意
2.0 js解析问题
根据官方文档描述,仅能解析js库: htmx, jQuery, jQuery, MochiKit, GWT, Sarissa, MooTools, Prototype, Ext, Dojo, Dojo, YUI所以遇到经过加密的JS文件、以及其他库很可能会解析失败 === 所以模拟抓百度翻译、腾讯翻译、有道翻译这些加密的JS抓不了,建议使用Selenium(Java)进行抓,不过这工具比较重,好用是非常好用、直接爬就完事压根就不用分析浏览器的请求
2.1 关闭HtmlUnit日志
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
3. 使用
依赖: https://search.maven.org/artifact/net.sourceforge.htmlunit/htmlunit
<dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.58.0</version> </dependency>
3.1 抓取IT之家周榜内容 – 单页面
抓取IT之家周榜的内容
/ * IT之家 */ @Test @SneakyThrows public void test10() {
//浏览器设置 WebClient webClient = new WebClient(); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setActiveXNative(false); //打开页面 HtmlPage page = webClient.getPage("https://www.ithome.com/"); //鼠标悬浮到周榜上 DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']"); page = (HtmlPage) inputEle.mouseOver(); DomElement ulElement = page.getFirstByXPath("//div[@id='rank']//ul[@id='d-2']"); //周榜信息 System.out.println(ulElement.asNormalizedText()); }
3.2 抓取IT之家周榜第九篇文章的内容 – 双页面
/ * IT之家周榜第九篇内容 */ @Test @SneakyThrows public void test11() {
WebClient webClient = new WebClient(); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setActiveXNative(false); HtmlPage page = webClient.getPage("https://www.ithome.com/"); //鼠标悬浮到周榜上 DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']"); page = (HtmlPage) inputEle.mouseOver(); //获取文章链接 List<DomElement> articleLinkElems = page.getByXPath("//div[@id='rank']//ul[@id='d-2']//a"); if(CollUtil.isNotEmpty(articleLinkElems)) {
//第九篇文章 page = articleLinkElems.get(8).click(); DomElement articleDivElem = page.getFirstByXPath("//div[@id='dt']//div[@class='fl content']"); System.out.println(articleDivElem.asNormalizedText()); } }
3.3 模拟用户操作 – (这个功能个人感觉非常非常的鸡肋,只能用于非常简单的JS,但是一般网站的动作触发都会进行一系列复杂的JS操作,所以想爬虫还是推荐Selenium)
示例页面
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>HtmlUnit测试</title> </head> <body> <form id="form" onclick="return false;"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"> <label for="uname"><b>账号</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname" required> <label for="psw"><b>密码</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw" required> <button id="loginBtn" type="button">登陆</button> </div> </form> <form id="form2" method="post" action="http://127.0.0.1:8080/login"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"> <label for="uname"><b>账号2</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname2" required> <label for="psw"><b>密码2</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw2" required> <button id="loginBtn2" type="submit">登陆2</button> </div> </form> </body> <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script> <script> $(function () {
//登陆 function loginOperation() {
$.post("http://127.0.0.1:8080/login",$("#form").serialize(),responseData => {
$("body").append(`<h1>${
JSON.stringify(responseData)}</h1>`) $("form").hide(); },"json") return false; } $("#loginBtn").click(loginOperation); }) </script> </html>
@Configuration public class SystemConfig {
//允许跨域 @Bean public CorsFilter corsFilter() {
CorsConfiguration corsConfiguration = new CorsConfiguration(); corsConfiguration.addAllowedOriginPattern("*"); corsConfiguration.setAllowCredentials(true); corsConfiguration.addAllowedMethod("*"); corsConfiguration.addAllowedHeader("*"); UrlBasedCorsConfigurationSource configSource = new UrlBasedCorsConfigurationSource(); configSource.registerCorsConfiguration("/", corsConfiguration); return new CorsFilter(configSource); } } @Controller @RequestMapping @ResponseBody public class LoginController {
@PostMapping("login") public Map login(HttpServletRequest request) {
Map parameterMap = new HashMap(request.getParameterMap()); parameterMap.put("name", "嗯嗯*"); return parameterMap; } }
/ * 模拟用户输入 */ @Test @SneakyThrows public void test12() {
WebClient webClient = new WebClient(); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setActiveXNative(false); //ajax手动提交的请求 HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html"); DomElement loginNameElem = page.getElementById("uname"); loginNameElem.setAttribute("value", "root"); DomElement passwordElem = page.getElementById("psw"); passwordElem.setAttribute("value", "pswroot"); //提交form1的表单 DomElement startLoginBtnElem = page.getElementById("loginBtn"); page = startLoginBtnElem.click(); DomElement userInfoDivElem = page.getFirstByXPath("//h1"); System.out.println(userInfoDivElem.asNormalizedText()); //================================================== //表单提交 == 返回的是JSON结果的页面,不是htmlPage页面故需要将结果转成UnexpectedPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html"); HtmlInput inputloginNameElem = (HtmlInput) page.getElementById("uname2"); inputloginNameElem.setAttribute("value", "root2"); HtmlInput inputpasswordElem = (HtmlInput) page.getElementById("psw2"); inputpasswordElem.setAttribute("value", "pswroot2"); //提交form2的表单 HtmlForm enclosingForm = inputloginNameElem.getEnclosingForm(); UnexpectedPage page2 = webClient.getPage(enclosingForm.getWebRequest(null)); //获取响应结果 System.out.println(page2.getWebResponse().getContentAsString(UTF_8)); }
3.4 文件下载
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>HtmlUnit测试</title> </head> <body> <form id="form" onclick="return false;"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"> <label for="uname"><b>账号</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname" required> <label for="psw"><b>密码</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw" required> <button id="loginBtn" type="button">登陆</button> </div> </form> <form id="form2" method="post" action="http://127.0.0.1:8080/login"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"> <label for="uname"><b>账号2</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname2" required> <label for="psw"><b>密码2</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw2" required> <button id="loginBtn2" type="submit">登陆2</button> </div> </form> <a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a> <br/> <a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a> </body> <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script> <script> $(function() {
//登陆 function loginOperation() {
$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
$("body").append(`<h1>${
JSON.stringify(responseData)}</h1>`) $("form").hide(); }, "json") return false; } $("#loginBtn").click(loginOperation); }) </script> </html>
package work.linruchang..htmlunitweb.controller; import cn.hutool.core.util.StrUtil; import lombok.SneakyThrows; import org.springframework.core.io.FileSystemResource; import org.springframework.http.HttpHeaders; import org.springframework.http.MediaType; import org.springframework.http.ResponseEntity; import org.springframework.stereotype.Controller; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.ResponseBody; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import java.net.URLEncoder; import java.util.HashMap; import java.util.Map; / * 作用: * * @author LinRuChang * @version 1.0 * @date 2022/02/09 * @since 1.8 / @Controller @RequestMapping @ResponseBody public class HtmlUnitController {
/ * 下载文件测试 * http://127.0.0.1:8080/download * @param request * @param httpServletResponse * @return */ @GetMapping("download") @SneakyThrows public ResponseEntity login(HttpServletRequest request, HttpServletResponse httpServletResponse) {
System.out.println(request.getSession().getId() + "开始下载"); FileSystemResource fileSystemResource = new FileSystemResource("E:\\微信\\文件\\WeChat Files\\wxid_n7xzf77wr3wv22\\FileStorage\\File\\2022-02\\房东符金瑞名下楼栋需要批量处理.xlsx"); HttpHeaders headers = new HttpHeaders(); headers.add("Cache-Control", "no-cache, no-store, must-revalidate"); headers.add("Content-Disposition", StrUtil.format("attachment; filename={}", URLEncoder.encode(fileSystemResource.getFilename()))); headers.add("Pragma", "no-cache"); headers.add("Expires", "0"); return ResponseEntity.ok() .headers(headers) .contentLength(fileSystemResource.contentLength()) .contentType(MediaType.parseMediaType("application/octet-stream")) .body(fileSystemResource); } }
package work.linruchang.; import cn.hutool.core.collection.CollUtil; import cn.hutool.core.io.IoUtil; import cn.hutool.core.lang.Console; import com.gargoylesoftware.htmlunit.*; import com.gargoylesoftware.htmlunit.html.*; import com.gargoylesoftware.htmlunit.javascript.host.event.KeyboardEvent; import lombok.SneakyThrows; import org.junit.Test; import java.io.FileOutputStream; import java.io.InputStream; import java.net.URLDecoder; import java.util.List; import java.util.logging.Level; import static java.nio.charset.StandardCharsets.UTF_8; / * 作用: * * @author LinRuChang * @version 1.0 * @date 2022/02/08 * @since 1.8 / public class HtmlUnitTest {
@Test @SneakyThrows public void test13() {
WebClient webClient = new WebClient(); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setActiveXNative(false); HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html"); //DomElement downloadBtn = page.getElementById("downloadBtn"); DomElement downloadBtn = page.getElementById("downloadBtn2"); //触发下载按钮 Page clickPage = downloadBtn.click(); //下面两句是等价 //Page enclosedPage = webClient.getWebWindows().get(webClient.getWebWindows().size() - 1).getEnclosedPage(); Page enclosedPage = clickPage.getEnclosingWindow().getEnclosedPage(); InputStream contentAsStream = enclosedPage.getWebResponse().getContentAsStream(); //获取文件名 String responseHeaderValue = enclosedPage.getWebResponse().getResponseHeaderValue(HttpHeader.CONTENT_DISPOSITION); String documentName = responseHeaderValue.split(";")[1].split("=")[1].trim(); documentName = URLDecoder.decode(documentName); Console.log("文件下载成功:{}",documentName); //存入数据库 IoUtil.copy(contentAsStream, new FileOutputStream("C:\\Users\\Administrator\\Desktop\\图片\\"+ documentName)); } }
3.5 弹框处理
示例页面
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>HtmlUnit测试</title> </head> <body> <form id="form" onclick="return false;"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"> <label for="uname"><b>账号</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname" required> <label for="psw"><b>密码</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw" required> <button id="loginBtn" type="button">登陆</button> </div> </form> <form id="form2" method="post" action="http://127.0.0.1:8080/login"> <div class="container"> <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"> <label for="uname"><b>账号2</b></label> <input type="text" placeholder="Enter Username" name="uname" id="uname2" required> <label for="psw"><b>密码2</b></label> <input type="password" placeholder="Enter Password" name="psw" id="psw2" required> <button id="loginBtn2" type="submit">登陆2</button> </div> </form> <a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a> <br/> <a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a> <br/> <button id="alertBtn">弹出信息</button> <br/> <button id="promptBtn">提示框信息</button> <br/> <button id="confirmBtn">确认框信息</button> </body> <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script> <script> $(function() {
var i = 0; $("#alertBtn").click(function() {
alert("点击触发弹框信息: 第" + ++i + "次") }) var j = 0; $("#promptBtn").click(function() {
prompt("点击触发提示框信息: 第" + ++j + "次", "默认值1111") }) var k = 0; $("#confirmBtn").click(function() {
confirm("点击触发确认框信息: 第" + ++k + "次") }) //登陆 function loginOperation() {
$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
$("body").append(`<h1>${
JSON.stringify(responseData)}</h1>`) $("form").hide(); }, "json") return false; } $("#loginBtn").click(loginOperation); }) </script> </html>
HtmlUnit模拟用户触发弹框
@Test @SneakyThrows public void test15() {
WebClient webClient = new WebClient(); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setActiveXNative(false); List<String> alertInfos = new ArrayList<>(); webClient.setAlertHandler(new CollectingAlertHandler(alertInfos)); //提示框处理 final List<String> promptInfos = new ArrayList<>(); webClient.setPromptHandler(new PromptHandler() {
@Override public String handlePrompt(Page page, String message, String defaultValue) {
Console.log("Prompt信息:{}、{}", message,defaultValue); promptInfos.add(message); return StrUtil.blankToDefault(message,defaultValue); } }); //确认框消息处理 final List<String> confirmInfos = new ArrayList<>(); webClient.setConfirmHandler(new ConfirmHandler() {
@Override public boolean handleConfirm(Page page, String message) {
confirmInfos.add(message); //true确认 false取消弹框 return true; } }); HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html"); DomElement alertBtn = page.getElementById("alertBtn"); page = alertBtn.click(); DomElement promptBtn = page.getElementById("promptBtn"); page = promptBtn.click(); page = promptBtn.click(); DomElement confirmBtn = page.getElementById("confirmBtn"); page = confirmBtn.click(); page = confirmBtn.click(); page = confirmBtn.click(); Console.log("弹框信息:{}", alertInfos); Console.log("提示框信息:{}", promptInfos); Console.log("确认框信息:{}", confirmInfos); }
免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://haidsoft.com/128033.html