forked from wolever/Protocol-Informatics
-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
232 lines (180 loc) · 7.66 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
The Protocol Informatics Framework
----------------------------------
Written by Marshall Beddoe <mbeddoe@baselineresearch.net>
Extended and modified by Lothar Braun <braun@net.in.tum.de>
Copyright (c) 2004 Baseline Research
Copyright (c) 2011 Lothar Braun
Source code repository available at
https://github.com/constcast/Protocol-Informatics
Overview:
The Protocol Informatics project is a software framework that allows for
advanced sequence and protocol stream analysis by utilizing bioinformatics
algorithms. The sole purpose of this software is to identify protocol fields in
unknown or poorly documented network protocol formats. The algorithms that are
utilized perform comparative analysis on a series of samples to better
understand the underlying structure of the otherwise random-looking data. The
PI framework was designed for experimentation through the use of a widget-based
component set.
The framework aims at including a number of different algorithms that help
with identifying protocol structures from network trace. It is shipped with a
command line interface that interactively allows one to control the process
of inferring protocol information from network traces.
Requirements:
-------------
Python >= 2.4 http://www.python.org
numpy http://numpy.scipy.org/
PyYAML http://pyyaml.org/
Optional:
Pcapy http://oss.coresecurity.com/projects/pcapy.html
Pydot http://code.google.com/p/pydot/
Controlling PI using the command line interface:
------------------------------------------------
All commands in the interface should be documented using an online
help. Whenever you want to learn more about a command, just use the
online help:
inf> help quit
Quit the program.
Program start:
You can start the program with or without an configuration file:
./main -c config.yml
If you do not specify a configuration file, a default file
'config.yml' in the current working directory will be used. If that
file does not exist, a default configuration will be loaded and
stored in 'config.yml'.
Whenever you make any changes to the configuration file in your
program, e.g. using the "config" command, you can save your
configuration using the "saveconfig" command. You can then load the
configuration on program start. If you set the configuration
parameters "inputFile" and "format" in your config, PI will
automatically try to read input from this file.
Reading input data:
There are basically two ways to read sequences into your
environment. You can set the "inputFile" and "format" configuration
variables, save your config using "saveconfig", and restart the
program using the "restart" command.
Or you can explicitly read input using the "read" command in your
environment. We will now show how first steps in the command line
interface can look like.
First steps:
Start the program:
$ ./main.py
No default configuration found. Creating a default config file
"config.yml".
Welcome to Protocol-Informatics. What do you want to do today?
inf>
This creates your default configuration file with default parameters
and drops you into the command line prompt. You can list the available
commands using the "help" command:
: inf> help
:
: Documented commands (type help <topic>):
: ========================================
: EOF PI config env exit help quit read restart saveconfig seqs show
:
: inf>
For each command, you can get verbose help by specifying the commands'
name to the help command itself:
: inf> help read
: Command syntax: read [<bro|pcap|ascii|config>] <file>
:
: Tries to read file <file> in the specified format. If format
: equals "config", a new configuration file is read from <file>.
: In all other cases, input data for the protocol inferences are
: read in the specified format (bro, pcap, ascii)
: inf>
An important command is the "config" command which can be used to read
and set configuration variables. If it is run without an argument, it
will print the configuration:
: inf> config
: ethOffset 14
: maxMessages 50
: weight 1.0
: format pcap
: graph False
: textBased False
: configFile config.yml
: messageDelimiter None
: onlyUniq False
: gnuplotFile None
: inputFile None
: interactive True
The configuration parameters are important for controlling the program
and will be documented in the following sections. The configuration
parameters can be group by their meaning and use in the modules. For
the main module, denoted by inf> there are the following important
parameters:
ethOffset:
Important when pcap files are read: Defines the length of the
ETH header. The default value is 14 (use 18 if you have a
trace from a VLAN tagged network.
maxMessages:
Defines how many messages will be read by default from the
input traces. If this is set to 0, all messages are read from
the input file.
onlyUniq:
Controls whether only unique messages are read from the input
file or if duplicate messages are allowed. Please note: this
parameter depends on the connection context. If this
configuration parameter is set to true, this will only remove
duplicate messages from within connections. Duplicate messages
that are distributed over multiple connections will still be
part of the input data.
inputFile:
Defines the filename that will be used to read messages from
format:
Defines the format that is used to read the filename specified
by inputFile. Possible values:
- pcap
- expects a pcap file as produced by tcpdump -w <filename>
- bro
- expects an adu file as produced by bro with the script
that is shipped with this source code in
bro-scripts/adu_writer.bro
- ascii
- expects a textfile which contains a number of messages
separate with the newline character
PCAP Files can easily be converted into BRO files via the following command:
CD to bro-script directory
<path_to_bro>/bin/bro -C -r <path_to_pcap> adu_writer.bro
configFile:
filename of the configuration yml file which is used to store
the current config with the saveconfig command.
interactive:
Defines if PI should run in interactive or non-interactive
mode. Currently, only interactive mode is supported.
Other configuration parameters are only necessary in
submodules. Currently, we have the following submodules:
- seqs
Offers methods for changing and looking at input
data. This module allows, for example, to select a
random subsample of the input data, or to only select
unique messages
- PI
Offers the original functionality of the PI
framework. Can create distance matrices, phylogeny
trees and can perform multi-sequence aligning. Please
find more information on the code in PI/README.
Configuration parameters for the "seqs" module:
messageDelimiter:
This configuration parameter can be used to split messages
according to a sequence of characters.
fieldDelimiter:
Currently unused.
Configuration parameters for the PI module:
graph:
Decides whether graphs are written to disk
gnuplotFile:
Currently unused
weight:
Weight used to determine how many clusters are found when
grouping messages according to their similarity.
=== Discoverer module specific config options ===
minWordLength:
The minimum lenght of printable characters considered as a text token
ASCIILowerBound:
The lowest ASCII character considered as printable token (used for text classification)
ASCIIUpperBound:
The highest ASCII character considered as printable token (used for text classification)
dumpFile:
Path where to write the discoverer results to when the 'dumpresult' command is executed.
The filename is taken from the inputFile configuration