2018 Viterbi Masters Hackathon

Starup Garage - ISI - Marina del Rey - 6 April 2018 through 8 April 2018

Problem Description

Network Communication Policy Generation

Given a data-set of connections established by devices on a local area network, teams will identify the kinds of devices as specifically as possible, and generate a policy to block communications that are dangerous (i.e. indicating or enabling subversion or ex-filtration attempts). At the end of the competition teams will apply the system they develope against a revised data set, identify which packets are to be blocked, and output a census describing the devices identified as present on the network. Scoring will be based on correct identification of devices, as well as false positives and negative rates on the connections that are allowed by the generated policies. Scoring will also consider the additional data sets and device research contributed for use by all teams.

Structure of Teams

Four teams of 8 students each have been assigned. The assignments were made to ensure that each team has students with each of the course background and technical interests necessary to be successful. We will accept requests to switch teams if there are particular individuals you wish to work with, but such exchanges must be made in a way that maintains the balance of background on each team.

Specific Tasks to be Performed

It is not our intent to define the structure of the software that will be devloped by your team, but we do note that in developing your solution there will be the need for the following activities.

Policy Enforcement - This component will read packet descriptions from the final list of test traffic and determine if the packet should be blocked. There should be a function that returns 0 (for block) and 1 (for pass). Additionally there should be a function that will read lines from the test traffic file, apply the policy enforcement function, the write the packet description to an ouput file with an additional field which is the 0 or 1 determine by applying the policy enforcement function.
Device Identification - This component will parse data from packet traces to identify as specifically as possible the devices found on the network. Identification could be as specific as device type, name, software version, mid-range as in the device itself, or minimal as in a class of device (e.g. television, home assistnt, laptiop, phone).
Polcy Discovery - Given the identification of the device, per the previous task, you are to generate a policy for allowing pacets to and from the device. Part of this policy will be determined by information that you know about the device from the device research component. e.g. you might know that Amazon Echo devices routinely access a particular range of addresses, and other accesses might be suspicious. You might determine some of the policy rules by learning them from the traces provided as trainig data (e.g. the same kind of device might be configuraed different ways, or user may use different services from the device that you want to encode in the policy rules).
Device and Configuration Management - This component will provide a user interface for viewing a list of devices that have been identified on the network, and to enable an admistrator to make policy decisions regarding the ability of the device to communicate with other devices on the network, and certain network ranges. There is a lot that can be done here if you chose to do so. For example, blocked connections could be identified when one clicks on an icon associated with the device, and one could overide the policy to enable those connections in the future.
Dataset Generation - One team member should be designated a lead for dataset generation and collection. Team members will identify and possibly generate (or request generation) of datasets that may be used as training data to establish the policies to be generated and the identifications to be performed by the tools described above. They will also generate modified datasets that will include entries intended for rejection by the policy tools described above. Datasets may be identified by looking for similar datasets that might be available through the web. They may be generated from packet traces which Professor Neuman will collect from his home network, which has many of the devices you may seek to identify. The lead member for this function will serve as a liasson to other teams and will contribute data on datasets for use by all teams. Part of the scoring for teams will include the number and quality of the data sets (and whether they were obtained elsewhere or generated). Teams will also be allowed to delay the contribution of datasets to the collective for up to 4 hours, giving them an advantage in the use of the data, however, only the first to be submitted will resceive credit in the scoring (if you generate the dataset yourself, then it will be unique anyway, but if found online, it is only the first to submit that received the credit).
Device Research - One team member should be designated a lead for device research. Team members will research specifications of home network devices through the internet, discussion forums, and other sources to identify rules that may be applied to identify devices (e.g. ranges of MAC addresses, what ports are open, and where the device communicates). The lead member for this function will serve as a liasson to other teams and will contribute data on identified devices for use by all teams. Part of the scoring for teams will include the number and quality of the device descriptions contributed for use by all teams. Teams will also be allowed to delay the contribution of device data to the collective for up to 4 hours, giving them an advantage in the use of the data, however, only the first to be submitted will resceive credit in the scoring.

Team Members

Each team has 8 members. You should allocate members to tasks so that each task has at least one member assigned as responsible. For the device research and dataset generation tasks, the team must desigate one leader who will serve as a liasson to the other teams. We will create slack channels for this communication. These two tasks will likely involve all team members sicne you will want to generate data and device information that is useful to your other team members.

Communication

Please join the slack workspace for the 2018 Viterbi Hackathon. Once teams are settled we will create a channel for each team. Additionally the Liassons for dataset generation and device research will be added to special channels for those functions.

Data set formats

The structure of the data for the communiations dataset will be CSV (comma separated values) including the following fields:

Date
Time
Source MAC
Source IP
Source Port
Dest MAC
Dest IP
Dest Port
Protocol
Good packet = 1, Bad packet = 0
Allowed = 1, Blocked = 0

The structure of the data for devices may include:

Mac address
IP Address
Device Class
Device Identification
Software Identification
Software Version
Open Ports
Communication Ranges and Protocols

Fields for additional kinds of datasets may be defined if needed.

Curated Datasets

Data for use in testing

We have several baseline data sets which you can use throughout your development to investigate device types, and/or to use as training data if you are applying an ML based approach. These baseline datasets were contributed by members of different teams and are posted in the datasetgeneration channel of the Slack workspace. Among the sets posted, the two best to work with are the sets CSV.rar posted by Senju, and the dataset posted by Manal. The CSV.rar dataset is particularly useful because it is broken into files by device type, and it provides a good sense of the typical communication for certain kinds of devices. While this data is in CSV format, I will note that the fields do not exacly line up with the format that was provided to you at the start of the hackathon, so some manipulation by your teem will be required. While I have called out these two data sets, other posted data sets by other teams are also useful to determine the policies to be enforced by your policy generator.

For the evaluation of your solutions, I will generate a new dataset that will be a constructed to embed selected components from the datasets just mentioned, but my dataset will be formatted according to the intial defintion provided to your teams. I will modify the addresses of devices, filter some unnecessary data, and insert attack packets which you are to detect and flag as blocked.

To help you test the basic operation of your solutions, I am including one such evaluation file dataset here:

Packet traces for water sensor

This testfile contains traffic associated with primarily one IoT device, which you will ideally identify. Additionally, the testfile containes a single malicious packet, which should also be blocked, without blocking the normal communication for this device. To help you identify this packet (to validate your testing), it is the only packet with a label in the Good/Bad field, and that label is 0. Note this field will NOT be filled in for the actual dataset. The time entry in this file is a sequence number of packets, rather than an actual time. You may assume that times in the actual test files will be monitonically increasing, but they might similarly not reflect an absolute time. The packets int his sample file do contain one additional field which is a comment. You should not rely on anything in this field for you actual end product. This field may be blank in the actual test file, or it might not be.

What to expect on Sunday

Sunday morning all teams should be putting the finishing touches on their software. You may use the test file desribed above to make sure you are processing inputs correctly. The expected output of your program will be a copy of the test file above, but with a 0 (block) or 1 (allow) in the allow/block colum on the output CSV file. Additionally, during the processing of the file you should generate a list of devices that have been identified on the network, and for each you should associate as much information (IP address, optional mac address, device class, optional device identification and software identification, and addresses with which the device may communicate). Please generate a table with the list of devices, which you can then include in the slides for your presentation.

Here is early access to the actual test dataset you will run your system against. There have been 4 manually inserted attack packets in this dataset (there might be other attack traffic previously unidentified in the original baseline data, but that is not the primary focus of this exercise). You can use this data set to debug, but when your system detects anomoulous packets in the final run, I will ask how it identified them, so that I know it is the system that detected the traffic and not other tests which were used to give your program the correct results.

Test Data for Hackathon

Around 11:30 AM We will have you run your prototype system on a testfile to be provided.

At noon we will have lunch and an introduction of the alumni judges. At 1PM or slightly earlier, each team will be given 10 mintes to make a powerpoint presentation describing 1) your understanding of the problem you were given to solved, 2) a description of you approach to solving the problem, 3) a description of the datasets you generated or used in developig your solution, 4) why you believe your solution is effective, and 5) what you think would be needed in terms of technical changes/development to turn your prototype into a useful product. You should prepare this presentation beginning Sunday morning.