]>
2019-11-12T16:05:07+00:00
Negative selection of written language using character multiset statistics
0
We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.
Added on initial load
Longitudinal
Edits between successive Article versions
Applying the NSA to sparse data such as written language requires special attention in terms of matching rules and representation of artificial antibodies due to the sparsity of data. This study has introduced a general method based on first-order statistics of character strings to implement a compact negative representation of text documents and for detecting changes in them. The distinctive feature of this method is to apply a special matching rule which compares the statistics of two strings instead of traditional bit-wise matching rules. As a side product this matching rule has the ability to detect deletions, which is a well known pitfall of many NSA algorithms.
We have shown that a reasonably simple Markov
model and a prior character probability distribution
can be effective in predicting the detection sensitivity of SF-NSA. Further, this model can be used in optimized
versions of the detector generating algorithm.
Simulation results
Websites
6
1256-1266
Yes
Journal article
Mathematical modeling
Simulation
Statistical analysis
A common data mining problem is how to locate and analyze changes in the sequences in order to detect either intentional modifications or unintentional corruption in the data.
In the following, we present a theoretical analysis on locating modifications in a symbol sequence using a combination of negative selection and character frequency analysis as a preprocessing stage. Also, we analyze how a na¨ıve first-order statistical language model can be used to approximate the detection probability.
10880
In Section 3, we present the
symbol frequency algorithm for anomaly detection in
discrete sequences. In Section 4, we present two Markov
chain models for analyzing the theoretical detection
probabilities.
Design and action
Negative selection of written language using character multiset statistics
Edit
25
Sample data
Live Wikipedia
English
Article
2010Z
2455197.5
2012-03-15T20:29:44Z
2456002.35398
2014-01-30T20:30:03Z
2456688.3542
Negative selection of written language using character multiset statistics