Parsing text in Haskell with Alex

by Mithrandir

Recently I remembered that I’ve started HaCoTeB as a project in which to test text lexing and parsing with Alex and Happy. Things didn’t work so well and HaCoTeB died without a single Alex specification file written.

However, during this little free time I had this week, I tried again to parse a simple text file with Alex.

We are given an input file consisting of text and properly enclosed mark-ups like ** for bold text, // for italic text, — for stroked-out text and __ for underline text (each of them enclosing the text enriched with the format). For example, //this should be in italic// should be an italic text.

In order to do this, we begin be specifying the regular expressions used to split the text into tokens (also known as lexing or scanning) and the actions associated with each expression. First, we define the special symbols, used as a mark-up.

00021  
00022 -- put here all characters used to select format 
00023 $formatspec = [\*\/_\-] 
00024  

Then we specify the expressions and their actions.

00027  
00028 [^$formatspec]+ { takeText } 
00029 $formatspec { takeText } -- for single chars (think a * b) 
00030 \/{2} { flipFormat Italic } 
00031 \*{2} { flipFormat Bold } 
00032 _{2} { flipFormat Underline } 
00033 \-{2} { flipFormat Strikeout } 
00034 

As you see in line 29, we need to handle single appearances of mark-up characters in a separate regular expression. Otherwise, text like “mark-up” or “a * b” will result in a scan error.

Before defining the action, let’s look at how the task at hand was solved. Basically, we will gather as much text as possible without including any format specifier into a single value, constructed using takeText. When we encounter a format specifier we change a global state to inform the subsequent takeText call that the gathered string is enriched with a new format. Because that data is not defined anywhere in the Alex templates, we will have to implement it ourselves. To do that, we tell Alex that we want a monadic wrapper with user data

00018  
00019 -- the monadUserState allows for a monadic parser with user data 
00020 %wrapper "monadUserState" 
00021  

Then, we define the user data and the initial value for it (don’t change the name of the following functions, they are called this way from the template)

00071 type AlexUserState = [Format] 
00072  
00073 alexInitUserState = [] 
00074  

The Format data is another defined type, its definition may be left as an exercise since it is not relevant here.

Also, although the wrapper is called monadUserState there are no functions defined there to do that. You’ll have to implement them by hand for now, a mail was sent describing the problem on haskell-cafe, maybe it will be solved soon. (Also, this wrapper is not presented in the official documentation for Alex, I have found it while trying to solve other problems by looking at the templates).

00074  
00075 getUserData :: Alex AlexUserState
00076 getUserData = Alex $ \s@AlexState{alex_ust=udata} -> Right (s, udata) 
00077  
00078 setUserData :: AlexUserState -> Alex () 
00079 setUserData udata = Alex $ \s -> Right (s{alex_ust=udata}, ()) 
00080  

The last thing that you need to define is what should be done when the end of the file  (EOF) is encountered. We will return Nothing, making all actions have return type of Alex (Maybe Token) instead of simply Alex Token.

00097  
00098 {- 
00099 Remember that this function must also return a token!! We have made it to 
00100 return Nothing to enable you to forget about it. 
00101 -} 
00102 alexEOF :: Alex (Maybe Token) 
00103 alexEOF = return Nothing
00104 } 
00105  

While this has the advantage that you can use only the good tokens in the Token data definition, it has the disadvantage that two layers of monads are present. Of course, this function can be implemented by each user.

In the end, we need something to start the whole parsing process.  We do this via two functions: one to loop over the text returning it token by token and collecting the results into a list and another to start the whole computation.

00055  
00056 run :: String -> Either String [Token] 
00057 run content = runAlex content $ loop [] 
00058  
00059 loop end = do --alexMonadScan >>= \t -> loop end >>= \e -> return $ t : e 
00060   tok <- alexMonadScan; 
00061   case tok of 
00062     Nothing -> return end
00063     Just t -> loop end >>= \e -> return $ t : e
00064  

The end result looks like this:

[Text [Bold] "This is a bolded paragraph, every word here should be bold ",Text [Strikeout,Bold] "and this part should be stroked out",Text [Bold] " while ",Text [Italic,Bold] "this should be in italic",Text [Bold] " and ",Text [Underline,Bold] "this underlined",Text [Bold] ". ",Text [Italic,Strikeout,Underline,Bold] "This part should have all of the attributes",Text [Bold] "."]

However, this is not yet complete, I guess that no user would want to look at something like that. A conversion function is needed but that is not in the scope of this article. I’ll return on this topic later.

About these ads