With the ever increasing advances in network protocols and traffic complexity, new challenges are emerging in traffic characterization and management. In this thesis, we propose a new approach that can complement existing ones with a simple high-level understanding of network traffic. Our approach uses (p,n)-grams representation to analyze network traffic, where a (p,n)-gram is an $-byte string starting at offset p.
We argue that the (p,n)-grams representation combines the efficiency of using specific packet fields (e.g. ports) with the generalized pattern matching of n-grams, without
the complexity and overhead of full packet pattern matching. We also show that using (p,n)-grams allows for traffic analysis at all packet parts (payload content, header port/flow, and other header behavior fields), without mixing between similar patterns that may accidentally exist at different fields within packets.
As a proof of concept, we develop a (p,n)-gram-based lightweight unsupervised clustering algorithm (ADHIC) that makes no prior assumptions about the involved protocols. We show that ADHIC can automatically cluster network traffic using a binary decision tree into equivalence
classes that closely approximate standard measures of network traffic. We also show that ADHIC can be used to monitor network traffic through observing the dynamic updates to the clustering tree. Those incremental updates highlight the temporal changes in network traffic that are not easily detected using standard network analysis methods.
We then research the characteristics and distributions of (p,n)-grams in network packets, and how they can be utilized for traffic analysis. In particular, we argue that (p,n)-grams have automatic fingerprinting capability where a simple frequency
analysis of network packets can capture structural (p,n)-grams based on their relative high frequencies. These (p,n)-grams represent protocol and sub-protocol structures and cross-protocol patterns.
We observe that (p,n)-grams follow a power-law-like distribution where the structural ones constitute the rapidly-dropping-off curve before the long tail. We argue that this special distribution adds to the efficiency of (p,n)-grams-based traffic analysis as it describes structural (p,n)-grams as 1) a small set of (p,n)-grams that 2) can be easily distinguished from the long list. Our
observation relies on a thorough empirical analysis using independent network traffic traces.