University of Twente Student Theses
License-aware web crawling
Ilich, S. (2024) License-aware web crawling.
PDF
549kB |
Abstract: | The training of generative artificial intelligence (AI) models demands extensive datasets often sourced from web scraping. However, current practices frequently overlook copyright compliance, posing significant ethical and legal challenges. This project aims to develop a tool for license-aware web crawling leveraging natural language processing (NLP) techniques to detect and extract licensing information from websites automatically. The tool demonstrated high accuracy in license type detection, achieving 100%, and moderate effectiveness in extracting license text, with ROUGE-L scores showing an F1 score of 0.499, precision of 0.588, and recall of 0.503. By identifying the specific license type, the algorithm facilitates the creation of legally compliant datasets essential for responsible AI training. This tool not only ensures adherence to copyright laws but also promotes ethical data usage, thereby supporting the sustainable advancement of AI technologies. |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science BSc (56964) |
Link to this item: | https://purl.utwente.nl/essays/100804 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page