University of Twente Student Theses

Login
As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Leveraging LLMs for Automating the Extraction of Users and Financial Structures from the Multilingual Unstructured Data Leak of I-Soon

Condu, Alexandru-Stefan (2025) Leveraging LLMs for Automating the Extraction of Users and Financial Structures from the Multilingual Unstructured Data Leak of I-Soon.

This is the latest version of this item.

[img] PDF
191kB
Abstract:The I-Soon data leak provides immeasurable insights into the inner workings of a private cybersecurity contractor involved in state-affiliated cyber-espionage activities. This paper delves into the usage of Large Language Models (LLMs) to extract key users and financial structures from the multilingual, unstructured dataset leaked anonymously on GitHub. By leveraging various pipelines (available at https://github.com/alexCondu/LLM-pipelines-for-I-Soon-Analysis), the LLMs incorporate data parsing, translation, enrichment and analysis, demonstrating their capabilities of parsing files (.md, .png, .txt, .log) into CSVs, translating the contents of the messages from Chinese to English through a multi-thread approach, identifying user and financial data and creating structured profiles of the actors involved. The LLM powered pipelines reduce the time spent by law enforcement, increasing the speed, scale and consistency of the analysis. Despite the challenges underlying in message translation, OCR extraction and noise within the data, the LLMs can effectively determine the company’s name, URL, CEO, financial insights and user profiles, laying a foundation for AI-driven cyber-leak investigations.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Business & IT BSc (56066)
Link to this item:https://purl.utwente.nl/essays/107499
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page