Skip to content

Latest commit

 

History

History
24 lines (15 loc) · 2.19 KB

README.md

File metadata and controls

24 lines (15 loc) · 2.19 KB

Read the blog

Documentation on the IDS analysis project is at https://r-dube.github.io/CICIDS/

Peer-reviewed Research Paper

URL: https://www.researchgate.net/publication/376891771

Title: Faulty use of the CIC-IDS 2017 dataset in information security research

Abstract: The summarized traffic flow version of the Canadian Institute for Cybersecurity Intrusion Detection Evaluation dataset created at the University of New Brunswick in 2017 is popular in the information security data science research community. Typically, researchers use the summarized data to develop supervised machine learning models and test the classification performance of these models. In this paper, we explore the adequacy of the summarized data for high-performance classification. We show that machine learning models developed over summarized data are unlikely to have practical import. Finally, we postulate that researchers may have a higher probability of creating a useful system if they use raw (non-summarized) data.

Keywords: Machine learning, Classification, Network security, Intrusion detection system, Network traffic analysis

Podcast Overview of the Peer-reviewed Research Paper

URL: https://youtu.be/hYf-0nFZw-I

Title: Faulty use of the CIC-IDS 2017 dataset Length: Approximately 11 minutes 15 seconds

Technical Report

There is a also technical report (preprint) on the project at http://dx.doi.org/10.13140/RG.2.2.25435.64809

Title: (Mis)use of the CICIDS 2017 Dataset in Information Security Research

Abstract: The summarized traffic flow version of the CICIDS 2017 dataset created at the University of New Brunswick is popular in the information security data science research community. Typically, researchers use the summarized data to develop supervised machine learning models and test the classification performance of these models. In this paper, we explore the adequacy of the summarized data for high-performance classification. We show that machine learning models developed over summarized data are unlikely to have practical import. Finally, we postulate that researchers may have a higher probability of creating a useful system if they use raw (non-summarized) data.