2018 Viterbi Masters Hackathon
Starup Garage - ISI - Marina del Rey - 6 April 2018 through 8 April 2018
Problem Description
Network Communication Policy Generation
Given a data-set of connections established by devices on a local area
network, teams will identify the kinds of devices as specifically as
possible, and generate a policy to block communications that are
dangerous (i.e. indicating or enabling subversion or ex-filtration
attempts).
At the end of the competition teams will apply the system they
develope against a revised data set, identify which packets are to be
blocked, and output a census describing the devices identified as
present on the network.
Scoring will be based on correct identification of devices, as well as
false positives and negative rates on the connections that are allowed
by the generated policies. Scoring will also consider the additional
data sets and device research contributed for use by all teams.
Structure of Teams
Four teams of 8 students each have been assigned. The assignments
were made to ensure that each team has students with each of the
course background and technical interests necessary to be successful.
We will accept requests to switch teams if there are particular
individuals you wish to work with, but such exchanges must be made in
a way that maintains the balance of background on each team.
Specific Tasks to be Performed
It is not our intent to define the structure of the software that will
be devloped by your team, but we do note that in developing your
solution there will be the need for the following activities.
- Policy Enforcement - This component will read packet descriptions
from the final list of test traffic and determine if the packet should
be blocked. There should be a function that returns 0 (for block) and
1 (for pass). Additionally there should be a function that will read
lines from the test traffic file, apply the policy enforcement
function, the write the packet description to an ouput file with an
additional field which is the 0 or 1 determine by applying the policy
enforcement function.
- Device Identification - This component will parse data from packet
traces to identify as specifically as possible the devices found on
the network. Identification could be as specific as device type,
name, software version, mid-range as in the device itself, or minimal
as in a class of device (e.g. television, home assistnt, laptiop,
phone).
- Polcy Discovery - Given the identification of the device, per the
previous task, you are to generate a policy for allowing pacets to and
from the device. Part of this policy will be determined by
information that you know about the device from the device research
component. e.g. you might know that Amazon Echo devices routinely
access a particular range of addresses, and other accesses might be
suspicious. You might determine some of the policy rules by learning
them from the traces provided as trainig data (e.g. the same kind of
device might be configuraed different ways, or user may use different
services from the device that you want to encode in the policy rules).
- Device and Configuration Management - This component will provide
a user interface for viewing a list of devices that have been
identified on the network, and to enable an admistrator to make policy
decisions regarding the ability of the device to communicate with
other devices on the network, and certain network ranges. There is a
lot that can be done here if you chose to do so. For example, blocked
connections could be identified when one clicks on an icon associated
with the device, and one could overide the policy to enable those
connections in the future.
- Dataset Generation - One team member should be designated a lead
for dataset generation and collection. Team members will identify and
possibly generate (or request generation) of datasets that may be used
as training data to establish the policies to be generated and the
identifications to be performed by the tools described above. They
will also generate modified datasets that will include entries
intended for rejection by the policy tools described above. Datasets
may be identified by looking for similar datasets that might be
available through the web. They may be generated from packet traces
which Professor Neuman will collect from his home network, which has
many of the devices you may seek to identify. The lead member for
this function will serve as a liasson to other teams and will
contribute data on datasets for use by all teams. Part of the scoring
for teams will include the number and quality of the data sets (and
whether they were obtained elsewhere or generated). Teams will also
be allowed to delay the contribution of datasets to the collective for
up to 4 hours, giving them an advantage in the use of the data,
however, only the first to be submitted will resceive credit in the
scoring (if you generate the dataset yourself, then it will be unique
anyway, but if found online, it is only the first to submit that
received the credit).
- Device Research - One team member should be designated a lead for
device research. Team members will research specifications of home
network devices through the internet, discussion forums, and other
sources to identify rules that may be applied to identify devices
(e.g. ranges of MAC addresses, what ports are open, and where the
device communicates). The lead member for this function will serve as
a liasson to other teams and will contribute data on identified
devices for use by all teams. Part of the scoring for teams will
include the number and quality of the device descriptions contributed
for use by all teams. Teams will also be allowed to delay the
contribution of device data to the collective for up to 4 hours,
giving them an advantage in the use of the data, however, only the
first to be submitted will resceive credit in the scoring.
Team Members
Each team has 8 members. You should allocate members to tasks so that
each task has at least one member assigned as responsible. For the
device research and dataset generation tasks, the team must desigate
one leader who will serve as a liasson to the other teams. We will
create slack channels for this communication. These two tasks will
likely involve all team members sicne you will want to generate data
and device information that is useful to your other team members.
Communication
Please join the slack workspace for
the 2018
Viterbi Hackathon. Once teams are settled we will create a
channel for each team. Additionally the Liassons for dataset
generation and device research will be added to special channels for
those functions.
Data set formats
The structure of the data for the communiations dataset will be CSV (comma separated values) including the following fields:
- Date
- Time
- Source MAC
- Source IP
- Source Port
- Dest MAC
- Dest IP
- Dest Port
- Protocol
- Good packet = 1, Bad packet = 0
- Allowed = 1, Blocked = 0
The structure of the data for devices may include:
- Mac address
- IP Address
- Device Class
- Device Identification
- Software Identification
- Software Version
- Open Ports
- Communication Ranges and Protocols
Fields for additional kinds of datasets may be defined if needed.
Curated Datasets
Data for use in testing
We have several baseline data sets which you can use throughout your
development to investigate device types, and/or to use as training
data if you are applying an ML based approach. These baseline
datasets were contributed by members of different teams and are posted
in the datasetgeneration channel of the Slack workspace. Among the
sets posted, the two best to work with are the sets CSV.rar posted by
Senju, and the dataset posted by Manal. The CSV.rar dataset is
particularly useful because it is broken into files by device type,
and it provides a good sense of the typical communication for certain
kinds of devices. While this data is in CSV format, I will note that the fields do not exacly line up with the format that was provided to you at the start of the hackathon, so some manipulation by your teem will be required.
While I have called out these two data sets, other posted data sets by
other teams are also useful to determine the policies to be enforced
by your policy generator.
For the evaluation of your solutions, I will generate a new dataset
that will be a constructed to embed selected components from the
datasets just mentioned, but my dataset will be formatted according to
the intial defintion provided to your teams. I will modify the
addresses of devices, filter some unnecessary data, and insert attack
packets which you are to detect and flag as blocked.
To help you test the basic operation of your solutions, I am including
one such evaluation file dataset here:
This testfile contains traffic associated with primarily one IoT
device, which you will ideally identify. Additionally, the testfile
containes a single malicious packet, which should also be blocked,
without blocking the normal communication for this device. To help
you identify this packet (to validate your testing), it is the only
packet with a label in the Good/Bad field, and that label is 0. Note
this field will NOT be filled in for the actual dataset. The time
entry in this file is a sequence number of packets, rather than an
actual time. You may assume that times in the actual test files will
be monitonically increasing, but they might similarly not reflect an
absolute time. The packets int his sample file do contain one
additional field which is a comment. You should not rely on anything
in this field for you actual end product. This field may be blank in
the actual test file, or it might not be.
What to expect on Sunday
Sunday morning all teams should be putting the finishing touches on
their software. You may use the test file desribed above to make sure
you are processing inputs correctly. The expected output of your
program will be a copy of the test file above, but with a 0 (block) or
1 (allow) in the allow/block colum on the output CSV file.
Additionally, during the processing of the file you should generate a
list of devices that have been identified on the network, and for each
you should associate as much information (IP address, optional mac
address, device class, optional device identification and software
identification, and addresses with which the device may communicate).
Please generate a table with the list of devices, which you can then
include in the slides for your presentation.
Here is early access to the actual test dataset you will run your
system against. There have been 4 manually inserted attack packets in
this dataset (there might be other attack traffic previously
unidentified in the original baseline data, but that is not the
primary focus of this exercise). You can use this data set to debug,
but when your system detects anomoulous packets in the final run, I
will ask how it identified them, so that I know it is the system that
detected the traffic and not other tests which were used to give your
program the correct results.
Around 11:30 AM We will have you run your prototype system on a
testfile to be provided.
At noon we will have lunch and an introduction of the alumni judges.
At 1PM or slightly earlier, each team will be given 10 mintes to make
a powerpoint presentation describing 1) your understanding of the
problem you were given to solved, 2) a description of you approach to
solving the problem, 3) a description of the datasets you generated or
used in developig your solution, 4) why you believe your solution is
effective, and 5) what you think would be needed in terms of technical
changes/development to turn your prototype into a useful product. You
should prepare this presentation beginning Sunday morning.