Learning OCaml: Parsing Data with Scanf
In my previous article I mentioned that OCaml’s
Stdlib
leaves a lot to be desire when it comes to regular
expressions. One thing I didn’t discuss back then was that
the problem is somewhat mitigated by the excellent module
Scanf, which makes it easy to parse structured data.
Image that we’re dealing with a simple investment portfolio, where we have multiple records containing:
- Ticker symbol (e.g.
AAPL
) - Number of shares
- Current share price
Let’s assume this portfolio is stored in .csv
file and each
entry there looks something like APPL,10,150.50
. While there
are many ways to parse this data, I think Scanf
is probably
the simplest and most elegant of them:
Scanf.sscanf "AAPL, 10, 150.5" "%[^,], %d, %f" (fun ticker shares price -> (ticker, shares, price));;
- : string * int * float = ("AAPL", 10, 150.5)
As you can see we’re using a parsing format specifier that’s pretty similar to what we’d
normally use with printf
. %[^,]
is kind of weird and it means “read string until ,”.
We can’t use the regular %s
format specifier here, as it expects space-separated strings.
This, however, will work fine with %s
:
Scanf.sscanf "John Doe 33" "%s %s %d" (fun name surname age -> (name, surname, age));;
- : string * string * int = ("John", "Doe", 33)
Here’s a more complete example that parses a few portfolio records and calculates the value of the portfolio:
(* Example portfolio entries as strings *)
let portfolio_lines = [
"AAPL,10,178.23";
"GOOG,5,150.50";
"MSFT,20,299.01";
]
(* Parse a line into (ticker, shares, price) *)
let parse_line line =
Scanf.sscanf line "%[^,],%d,%f" (fun ticker shares price ->
(ticker, shares, price)
)
(* Compute total value of the portfolio *)
let total_value entries =
List.fold_left (fun acc (_ticker, shares, price) ->
acc +. float_of_int shares *. price
) 0.0 entries
let () =
let open Printf in
let portfolio = List.map parse_line portfolio_lines in
List.iter (fun (ticker, shares, price) ->
printf "%s: %d shares at $%.2f\n" ticker shares price
) portfolio;
let total = total_value portfolio in
printf "Total portfolio value: $%.2f\n" total
Not bad, right?
Scanf
has several functions in it and quite a lot of format specifiers that you can
leverage in various situations.
The formatted input functions can read from any kind of input, including
strings, files, or anything that can return characters. The more general source
of characters is named a formatted input channel (or scanning buffer) and has
type Scanf.Scanning.in_channel
. The more general formatted input function reads
from any scanning buffer and is named bscanf
.
Generally speaking, the formatted input functions have 3 arguments:
- the first argument is a source of characters for the input,
- the second argument is a format string that specifies the values to read,
- the third argument is a receiver function that is applied to the values read.
My trivial examples dealt only with input strings, but you can easily leverage other input sources. Here’s an example reading from the standard input:
Scanf.scanf "%s %f\n" (fun name price ->
Printf.printf "Item: %s, Price: %.2f\n" name price)
(* input -> Table 100.20 *)
(* output -> Item: Table, Price: 100.20 *)
- : unit = ()
Enter something like “Chair 20.25” and observe the results.
I’d encourage everyone to get familiar with the module’s documentation for all the nitty-gritty details.
Please, share in the comments how you’re using Scanf
in your OCaml projects and any tips
you might have about making the best of it.
That’s all I have for you. Keep hacking!