Assessing the effectiveness of crawlers and large language models in detecting adversarial hidden link threats in meta computing

Junjie Xiong; Mingkui Wei; Zhuo Lu; Yao Liu

doi:10.1016/j.hcc.2024.100292

High-Confidence Computing ›› 2025, Vol. 5 ›› Issue (3) :100292 DOI: 10.1016/j.hcc.2024.100292

Research Articles

research-article

Assessing the effectiveness of crawlers and large language models in detecting adversarial hidden link threats in meta computing

Author information +

History +

PDF (834KB)

Abstract

In the emerging field of Meta Computing, where data collection and integration are essential components, the threat of adversary hidden link attacks poses a significant challenge to web crawlers. In this paper, we investigate the influence of these attacks on data collection by web crawlers, which famously elude conventional detection techniques using large language models (LLMs). Empirically, we find some vulnerabilities in the current crawler mechanisms and large language model detection, especially in code inspection, and propose enhancements that will help mitigate these weaknesses. Our assessment of real-world web pages reveals the prevalence and impact of adversary hidden link attacks, emphasizing the necessity for robust countermeasures. Furthermore, we introduce a mitigation framework that integrates element visual inspection techniques. Our evaluation demonstrates the framework’s efficacy in detecting and addressing these advanced cyber threats within the evolving landscape of Meta Computing.

Keywords

Meta computing / Data integration / Adversary hidden link / Web crawling / Content deception detection / Large language model

Cite this article

Download citation ▾

Junjie Xiong, Mingkui Wei, Zhuo Lu, Yao Liu. Assessing the effectiveness of crawlers and large language models in detecting adversarial hidden link threats in meta computing. High-Confidence Computing, 2025, 5(3): 100292 DOI:10.1016/j.hcc.2024.100292

登录浏览全文

4963

注册一个新账户忘记密码

CRediT authorship contribution statement

Junjie Xiong: Writing - original draft, Investigation. Mingkui Wei: Supervision. Zhuo Lu: Supervision. Yao Liu: Investigation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	X. Cheng, M. Xu, R. Pan, D. Yu, C. Wang, X. Xiao, W. Lyu, Meta computing, IEEE Netw. (2023).

[2]	D.L. Goodhue, M.D. Wybo, L.J. Kirsch, The impact of data integration on the costs and benefits of information systems, MIS quarterly 29 (1992) 3-311.

[3]	C. Yang, Q. Huang, Z. Li, K. Liu, F. Hu, Big data and cloud computing: innovation opportunities and challenges, Int. J. Digit. Earth 10 (2017) 13-53.

[4]	K. Kambatla, G. Kollias, V. Kumar, A. Grama, Trends in big data analytics, J. Parallel Distrib. Comput. 74 (2014) 2561-2573.

[5]	M. Kumar, R. Bhatia, D. Rattan, A survey of web crawlers for information retrieval, Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 7 (2017) e1218.

[6]	C. Saini, V. Arora,Information retrieval in web crawling: A survey, in: 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI, IEEE, 2016, pp. 2635-2643.

[7]	J. Cho, H. Garcia-Molina, The evolution of the web and implications for an incremental crawler, in: VLDB, Citeseer, 2000, pp. 200-209.

[8]	C.C. Aggarwal, F. Al-Garawi, P.S. Yu, Intelligent crawling on the world wide web with arbitrary predicates, in:Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 96-105.

[9]	A. Shahzad, D.W. Jacob, N.M. Nawi, H. Mahdin, M.E. Saputri, The new trend for search engine optimization, tools and techniques, Indonesian J. Electr. Eng. Comput. Sci. 18 (2020) 1568-1583.

[10]	N. Kumar, D. Aggarwal, Learning-based focused web crawler, IETE J. Res. 69 (2023) 2037-2045.

[11]	U. Dayal, M. Castellanos, A. Simitsis, K. Wilkinson, Data integration flows for business intelligence, in:Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009, pp. 1-11.

[12]	X.L. Dong, T. Rekatsinas, Data integration and machine learning: A natural synergy,in:Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1645-1650.

[13]	I. Subramanian, S. Verma, S. Kumar, A. Jere, K. Anamika, Multi-omics data integration, interpretation, and its application, Bioinform. Biol. Insights 14 (2020) 1177932219899051.

[14]	M.J. Cafarella, A. Halevy, N. Khoussainova, Data integration for the relational web, Proc. VLDB Endow. 2 (2009) 1090-1101.

[15]	J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding, 2018, arXiv preprint arXiv:1810.04805.

[16]	S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, et al., Instruction tuning for large language models: A survey, 2023, arXiv preprint arXiv:2308.10792.

[17]	Z. Marashdeh, K. Suwais, M. Alia,A survey on sql injection attack: Detection and challenges, in: 2021 International Conference on Information Technology, ICIT, IEEE, 2021, pp. 957-962.

[18]	Y. He, G. Meng, K. Chen, X. Hu, J. He, Towards security threats of deep learning systems: A survey, IEEE Trans. Softw. Eng. 48 (2020) 1743-1770.

[19]	Google, Googlebot crawling and indexing, 2024, https://developers.google.com/search/docs/crawling-indexing/googlebot.

[20]	Microsoft, Bing webmaster, 2024, https://www.bing.com/webmasters/about.

[21]	Yandex, Yandex. webmaster tools, 2024, https://webmaster.yandex.com/.

[22]	Apple, Applebot, 2024, https://support.apple.com/en-us/119829.

[23]	SEO API, Duckduckbot information, 2024, https://seoapi.com/duckduckbot/.

[24]	Baidu, Baiduspider information, 2024, https://baiduspider.github.io/index.html.

[25]	SEO API, Sogou web spider information, 2024, https://seoapi.com/sogouwebspider/.

[26]	Facebook, Web crawlers documentation, 2024, https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/.

[27]	BotReports, Exabot report, 2024, https://abuseme.nl/botreports/e/exabot.html.

[28]	Swiftype, Swiftype documentation, 2024, https://swiftype.com/documentation.

[29]	B. Eaton,Web crawlers - top 10 most popular, 2024, https://www.keycdn.com/blog/web-crawlers/.

[30]	J. Holcombe, Crawler list: web crawler bots and how to leverage them for success, 2023, https://kinsta.com/blog/crawler-list/.

[31]	D. Lukáč, Top 11 open source web crawlers - and one power- ful web scraper, 2022, https://blog.apify.com/top-11-open-source-web-crawlers-and-one-powerful-web-scraper/.

[32]	W. Designs, 10 best open-source web crawlers 2024, 2023, https://wbcomdesigns.com/best-open-source-web-crawlers/.

[33]	Scrapy, Scrapy: An open source and collaborative web crawling framework, 2024, https://github.com/scrapy/scrapy.

[34]	Binux, Pyspider: A powerful spider web framework, 2024, https://github.com/binux/pyspider.

[35]	Code4Craft, Webmagic: A powerful web crawler framework, 2024, https://github.com/code4craft/webmagic.

[36]	Apify, Crawlee: A web scraping library, 2024, https://github.com/apify/crawlee.

[37]	BDA Research, Node crawler, 2024, https://github.com/bda-research/node-crawler.

[38]	Crummy, Beautiful soup documentation, 2024, https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

[39]	Nokogiri, Nokogiri: The HTML, XML, SAX, and reader for ruby, 2024, https://nokogiri.org/.

[40]	Yasser Ghanem, Crawler4j: A java library for crawling the web, 2024, https://github.com/yasserg/crawler4j.

[41]	MechanicalSoup, Mechanicalsoup: A python library for automating web interactions, 2024, https://github.com/MechanicalSoup/MechanicalSoup.

[42]	Apache, Apache nutch: open source web crawler, 2024, https://github.com/apache/nutch.

[43]	Similarweb, Top websites ranking, 2024, https://www.similarweb.com/top-websites/.

[44]	SEMRUSH, Top websites, 2024, https://www.semrush.com/website/top/.

[45]	Lightning-AI, Lit-GPT, 2024, https://github.com/Lightning-AI/litgpt.

[46]	H. Face, Hugging-face, 2024, https://huggingface.co/.

[47]	S. Biderman, H. Schoelkopf, Q.G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M.A. Khan, S. Purohit, U.S. Prashanth, E. Raff, et al., Pythia: A suite for analyzing large language models across training and scaling,in:International Conference on Machine Learning, PMLR, 2023, pp. 2397-2430.

[48]	G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M.S. Kale, J. Love, et al., Gemma: open models based on gemini research and technology, 2024, arXiv preprint arXiv:2403.08295.

[49]	Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, Y.T. Lee, Textbooks are all you need ii:phi-1.5 technical report, 2023, arXiv preprint arXiv: 2309.05463.

[50]	M. Abdin, S.A. Jacobs, A.A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl, et al., Phi-3 technical report:A highly capable language model locally on your phone, 2024, arXiv preprint arXiv:2404.14219.

[51]	X. Geng, H. Liu, Openllama: An open reproduction of llama, 2023, URL: https://github.com/openlm-research/open_llama.

[52]	W.L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J.E. Gonzalez, I. Stoica, E.P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023, URL: https://lmsys.org/blog/2023-03-30-vicuna/.

[53]	H. Face, Open LLM leaderboard, 2024, https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.

[54]	Anthropic, Build with claude: computer use, 2024, https://docs.anthropic.com/en/docs/build-with-claude/computer-use.

[55]	L. Wildwood, 13 website page load time statistics (2024 data), 2024, https://bloggingwizard.com/page-load-time-statistics.

[56]	Y. Li, J. Guo, Y. Li, T. Wang, W. Jia, An online resource scheduling for maximizing quality-of-experience in meta computing, 2023, arXiv preprint arXiv:2304.13463.

[57]	K. Kritikos, P. Massonet, An integrated meta-model for cloud application security modelling, Procedia Comput. Sci. 97 (2016) 84-93.

[58]	D. Chen, Y.C. Liu, B. Kim, J. Xie, C.S. Hong, Z. Han, Edge computing resources reservation in vehicular networks: A meta-learning approach, IEEE Trans. Veh. Technol. 69 (2020) 5634-5646.

[59]

S. Ghirmai, D. Mebrahtom, M. Aloqaily, M. Guizani, M. Debbah,Self-sovereign identity for trust and interoperability in the metaverse, in: 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles, SmartWorld/UIC/ScalCom/DigitalTwin/Pri- Comp/Meta, IEEE, 2022, pp. 2468-2475.

[60]	Y. Xu, D. Feng, M. Zhao, Y. Sun, X.G. Xia, Edge intelligence empowered metaverse: Architecture, technologies, and open issues, IEEE Netw. (2023).

[61]

A. Ferrari, F. Knabe, M. Humphrey, S. Chapin, A. Grimshaw, A flexible security system for metacomputing environments, in:High-Performance Computing and Networking: 7th International Conference, HPCN Europe 1999 Amsterdam, the Netherlands, April (1999) 12-14 Proceedings 7, Springer, 1999, pp. 370-380.

[62]	P.P. Ray, Web3: A comprehensive review on background, technologies, applications, zero-trust architectures, challenges and future directions, Internet Things Cyber-Phys. Syst. (2023).

[63]	T. Ryutov, G. Gheorghiu, B.C. Neuman, An authorization framework for metacomputing applications, Cluster Comput. 2 (1999) 165-175.

[64]	A. Yang, C. Lu, J. Li, X. Huang, T. Ji, X. Li, Y. Sheng, Application of meta-learning in cyberspace security: A survey, Digit. Commun. Netw. 9 (2023) 67-78.

[65]	A. Chai, Design and implementation of dynamic and efficient web crawler for xss vulnerability detection, in: 2017 5th International Conference on Machinery, Materials and Computing Technology, ICMMCT 2017, Atlantis Press, 2017, pp. 1169-1176.

[66]	Z. Guojun, J. Wenchao, S. Jihui, S. Fan, Z. Hao, L. Jiang,Design and application of intelligent dynamic crawler for web data mining, in: 2017 32nd Youth Academic Annual Conference of Chinese Association of Automation, YAC, IEEE, 2017, pp. 1098-1105.

[67]	V. Kumari, P. Rajput, S. Pundhir, M. Rafiq, Web crawler based on secure mobile agent, Res. J. Comput. Syst. Eng. 3 (2012) 419-423.

[68]	N. Pahal, S. Kumar, A. Bhardwaj, N. Chauhan, Security on mobile agent based crawler (smabc), Int. J. Comput. Appl. 1 (2010) 5-11.

[69]	D. Jenkins, L.M. Liebrock, V. Urias,Designing a modular and distributed web crawler focused on unstructured cybersecurity intelligence, in: 2021 International Carnahan Conference on Security Technology, ICCST, IEEE, 2021, pp. 1-6.

[70]	A. Singh, N. Goyal, Malcrawler: A crawler for seeking and crawling malicious websites,in:Distributed Computing and Internet Technology: 13th International Conference, ICDCIT 2017, Bhubaneswar, India, January (2017) 13-16, Proceedings 13, Springer, 2017, pp. 210-223.

[71]	X. Wu, D. Wei, B.P. Vasgi, A.K. Oleiwi, S.L. Bangare, E. Asenso, Research on network security situational awareness based on crawler algorithm, Secur. Commun. Netw. 2022 (2022).

[72]	A.Van. Deursen, A. Mesbah, A. Nederlof, Crawl-based analysis of web applications: Prospects and challenges, Sci. Comput. Program. 97 (2015) 173-180.

[73]	P. Koloveas, T. Chantzios, C. Tryfonopoulos, S. Skiadopoulos,A crawler architecture for harvesting the clear, social, and dark web for iot-related cyber-threat intelligence, in: 2019 IEEE World Congress on Services, SERVICES, IEEE, 2019, pp. 3-8.

[74]	D. Stevanovic, A. An, N. Vlajic, Feature evaluation for web crawler detection with data mining techniques, Expert Syst. Appl. 39 (2012) 8707-8717.

[75]	A. Kadadi, R. Agrawal, C. Nyamful, R. Atiq,Challenges of data integration and interoperability in big data, in: 2014 IEEE International Conference on Big Data (Big Data), IEEE, 2014, pp. 38-40.

[76]	A. Haynes, Hidden links & seo: how google handles them, 2022, https://loganix.com/hidden-links/#h.doffepo1a1lo.

[77]	Y. University, Usability & web accessibility, 2024,

[78]	Google, Spam policies for google web search, 2024, https://developers.google.com/search/docs/essentials/spam-policies#hidden-text-and-links.

[79]	F. Schneider, M. Cutts, Systems and methods for detecting hidden text and hidden links, 2013, US Patent 8 392, 823.

[80]	H. Aouadi, M.T. Khemakhem, M.B. Jemaa, Uncovering hidden links between images through their textual context, in: Enterprise Information Systems: 20th International Conference, ICEIS 2018, Funchal, Madeira, Portugal, March (2018) 21-24, Revised Selected Papers 20, Springer, 2019, pp. 370-395.

[81]	X. Yin, C. Ni, S. Wang, Multitask-based evaluation of open-source llm on software vulnerability, IEEE Trans. Softw. Eng. (2024).

[82]	Z. Li, S. Dutta, M. Naik, Llm-assisted static analysis for detecting security vulnerabilities, 2024, arXiv preprint arXiv:2405.17238.

[83]	Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey on large language model (llm) security and privacy: The good, the bad, and the ugly, High-Confid. Comput. (2024) 100211.

[84]	Y. Oliinyk, M. Scott, R. Tsang, C. Fang, H. Homayoun, et al., Fuzzing busybox: leveraging llm and crash reuse for embedded bug unearthing, 2024, arXiv preprint arXiv:2403.03897.

[85]	C. Fang, N. Miao, S. Srivastav, J. Liu, R. Zhang, R. Fang, R. Tsang, N. Nazari, H. Wang, H. Homayoun, et al., Large language models for code analysis: Do {LLMs} really do their job? in: 33rd USENIX Security Symposium, USENIX Security 24, 2024, pp. 829-846.

[86]	G. Alon, M. Kamfonas, Detecting language model attacks with perplexity, 2023, arXiv preprint arXiv:2308.14132.

[87]	L. Schwinn, D. Dobre, S. Günnemann, G. Gidel, Adversarial attacks and defenses in large language models: Old and new threats,in:Proceedings on, PMLR, 2023, pp. 103-117.