University of Twente Student Theses

Login

License-aware web crawling

Ilich, S. (2024) License-aware web crawling.

[img] PDF
549kB
Abstract:The training of generative artificial intelligence (AI) models demands extensive datasets often sourced from web scraping. However, current practices frequently overlook copyright compliance, posing significant ethical and legal challenges. This project aims to develop a tool for license-aware web crawling leveraging natural language processing (NLP) techniques to detect and extract licensing information from websites automatically. The tool demonstrated high accuracy in license type detection, achieving 100%, and moderate effectiveness in extracting license text, with ROUGE-L scores showing an F1 score of 0.499, precision of 0.588, and recall of 0.503. By identifying the specific license type, the algorithm facilitates the creation of legally compliant datasets essential for responsible AI training. This tool not only ensures adherence to copyright laws but also promotes ethical data usage, thereby supporting the sustainable advancement of AI technologies.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science BSc (56964)
Link to this item:https://purl.utwente.nl/essays/100804
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page