University of Twente Student Theses


Extracting Sections From PDF-Formatted CTI Reports

Koning, B. de (2022) Extracting Sections From PDF-Formatted CTI Reports.

[img] PDF
Abstract:Extracting text from a PDF file is a task that sounds easier than its real-life execution. PDF files namely only know the position of the characters on the page, not knowing that the characters form words together. Another challenge is to separate it into different sections and paragraphs. The text and section extraction is important for pre-processing Computer Threat Intelligence (CTI) reports. Processing these reports is part of the task description of Security Operation Centers (SOCs). These reports contain valuable information on active cyber threats and are therefore important for cybersecurity. This research paper focuses on text extraction from a PDF-formatted CTI report, intending to extract the text separated into the sections present in the CTI report. This paper presents, after a thorough analysis of multiple candidate tools, which text extraction tool is preferred for text and section extraction from a PDF file using Python. This tool is then implemented to work on a real-world CTI report.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Business & IT BSc (56066)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page