Parsing text in Haskell with Alex
by Mithrandir
Recently I remembered that I’ve started HaCoTeB as a project in which to test text lexing and parsing with Alex and Happy. Things didn’t work so well and HaCoTeB died without a single Alex specification file written.
However, during this little free time I had this week, I tried again to parse a simple text file with Alex.
We are given an input file consisting of text and properly enclosed mark-ups like ** for bold text, // for italic text, — for stroked-out text and __ for underline text (each of them enclosing the text enriched with the format). For example, //this should be in italic// should be an italic text.
In order to do this, we begin be specifying the regular expressions used to split the text into tokens (also known as lexing or scanning) and the actions associated with each expression. First, we define the special symbols, used as a mark-up.
00021 00022 -- put here all characters used to select format 00023 $formatspec = [\*\/_\-] 00024
Then we specify the expressions and their actions.
00027 00028 [^$formatspec]+ { takeText } 00029 $formatspec { takeText } -- for single chars (think a * b) 00030 \/{2} { flipFormat Italic } 00031 \*{2} { flipFormat Bold } 00032 _{2} { flipFormat Underline } 00033 \-{2} { flipFormat Strikeout } 00034
As you see in line 29, we need to handle single appearances of mark-up characters in a separate regular expression. Otherwise, text like “mark-up” or “a * b” will result in a scan error.
Before defining the action, let’s look at how the task at hand was solved. Basically, we will gather as much text as possible without including any format specifier into a single value, constructed using takeText. When we encounter a format specifier we change a global state to inform the subsequent takeText call that the gathered string is enriched with a new format. Because that data is not defined anywhere in the Alex templates, we will have to implement it ourselves. To do that, we tell Alex that we want a monadic wrapper with user data
00018 00019 -- the monadUserState allows for a monadic parser with user data 00020 %wrapper "monadUserState" 00021
Then, we define the user data and the initial value for it (don’t change the name of the following functions, they are called this way from the template)
00071 type AlexUserState = [Format] 00072 00073 alexInitUserState = [] 00074
The Format data is another defined type, its definition may be left as an exercise since it is not relevant here.
Also, although the wrapper is called monadUserState there are no functions defined there to do that. You’ll have to implement them by hand for now, a mail was sent describing the problem on haskell-cafe, maybe it will be solved soon. (Also, this wrapper is not presented in the official documentation for Alex, I have found it while trying to solve other problems by looking at the templates).
00074 00075 getUserData :: Alex AlexUserState 00076 getUserData = Alex $ \s@AlexState{alex_ust=udata} -> Right (s, udata) 00077 00078 setUserData :: AlexUserState -> Alex () 00079 setUserData udata = Alex $ \s -> Right (s{alex_ust=udata}, ()) 00080
The last thing that you need to define is what should be done when the end of the file (EOF) is encountered. We will return Nothing, making all actions have return type of Alex (Maybe Token) instead of simply Alex Token.
00097 00098 {- 00099 Remember that this function must also return a token!! We have made it to 00100 return Nothing to enable you to forget about it. 00101 -} 00102 alexEOF :: Alex (Maybe Token) 00103 alexEOF = return Nothing 00104 } 00105
While this has the advantage that you can use only the good tokens in the Token data definition, it has the disadvantage that two layers of monads are present. Of course, this function can be implemented by each user.
In the end, we need something to start the whole parsing process. We do this via two functions: one to loop over the text returning it token by token and collecting the results into a list and another to start the whole computation.
00055 00056 run :: String -> Either String [Token] 00057 run content = runAlex content $ loop [] 00058 00059 loop end = do --alexMonadScan >>= \t -> loop end >>= \e -> return $ t : e 00060 tok <- alexMonadScan; 00061 case tok of 00062 Nothing -> return end 00063 Just t -> loop end >>= \e -> return $ t : e 00064
The end result looks like this:
[Text [Bold] "This is a bolded paragraph, every word here should be bold ",Text [Strikeout,Bold] "and this part should be stroked out",Text [Bold] " while ",Text [Italic,Bold] "this should be in italic",Text [Bold] " and ",Text [Underline,Bold] "this underlined",Text [Bold] ". ",Text [Italic,Strikeout,Underline,Bold] "This part should have all of the attributes",Text [Bold] "."]
However, this is not yet complete, I guess that no user would want to look at something like that. A conversion function is needed but that is not in the scope of this article. I’ll return on this topic later.