Thursday, July 12, 2012

Aho–Corasick_string_matching_algorithm

codechef July Challenge: Favorite Number problem.

The algorithm finds all matching positions that matches one or more in the given set of patterns.

It implements a finite state automaton.  States are 0, 1, 2, ....

goto(s, a) = s'  move from state s to s', when input is letter a

fail(s, a) = s' move from state s to s', when input letter a does not match any next state

output(s) = { set of strings in pattern }


When do matching text to patterns, just walk through the trie and follow fail link when necessary, similar to Knuth-Morris-Pratt.

First we show how to use the finite state machine to process text

state = 0
for each ch in text
  while goto(state, ch) == fail do, state = fail(state)  // must stop at state = 0
  state = goto(state, ch)
  if output(state) not empty, then emit output(state)

Now we show how to construct the functions.

goto() is easy, for each keywords y[i] , i=1..k (we have k keywords)
state = 0
for each ch in y[i]
  if goto(state, ch) == fail, then goto(state, ch) = cntstate++
  else do nothing
append y[i] to output[state]

fail() function is constructed using a BFS on the graph. We first do depth=1 nodes, which have parent as state=0, then do it with increasing depth.

queue = empty
state = 0
// do depth=1
for each next_state of state, fail(next_state) = state, enqueue(next_state)
// do depth=d using d-1
while queue not empty do
  r = queue.pop()
  for each symbol ch
    if goto(r, ch) = s != fail, then  // s is at depth=d with parent=r
      enqueue(s)
      state = fail(r)
      while goto(state, ch) == fail, do state = fail(state)
      fail(s) = state
      output(s).append(output(state))

No comments:

Post a Comment