blob: 4c8ca7ebe5923f4673d4576b83cde5ccb5789c41 [file] [log] [blame]
/*!
A module for building and searching with lazy determinstic finite automata
(DFAs).
Like other modules in this crate, lazy DFAs support a rich regex syntax with
Unicode features. The key feature of a lazy DFA is that it builds itself
incrementally during search, and never uses more than a configured capacity of
memory. Thus, when searching with a lazy DFA, one must supply a mutable "cache"
in which the actual DFA's transition table is stored.
If you're looking for fully compiled DFAs, then please see the top-level
[`dfa` module](crate::dfa).
# Overview
This section gives a brief overview of the primary types in this module:
* A [`regex::Regex`] provides a way to search for matches of a regular
expression using lazy DFAs. This includes iterating over matches with both the
start and end positions of each match.
* A [`dfa::DFA`] provides direct low level access to a lazy DFA.
# Example: basic regex searching
This example shows how to compile a regex using the default configuration
and then use it to find matches in a byte string:
```
use regex_automata::{hybrid::regex::Regex, MultiMatch};
let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let mut cache = re.create_cache();
let text = b"2018-12-24 2016-10-08";
let matches: Vec<MultiMatch> =
re.find_leftmost_iter(&mut cache, text).collect();
assert_eq!(matches, vec![
MultiMatch::must(0, 0, 10),
MultiMatch::must(0, 11, 21),
]);
# Ok::<(), Box<dyn std::error::Error>>(())
```
# Example: searching with regex sets
The lazy DFAs in this module all fully support searching with multiple regexes
simultaneously. You can use this support with standard leftmost-first style
searching to find non-overlapping matches:
```
use regex_automata::{hybrid::regex::Regex, MultiMatch};
let re = Regex::new_many(&[r"\w+", r"\S+"])?;
let mut cache = re.create_cache();
let text = b"@foo bar";
let matches: Vec<MultiMatch> =
re.find_leftmost_iter(&mut cache, text).collect();
assert_eq!(matches, vec![
MultiMatch::must(1, 0, 4),
MultiMatch::must(0, 5, 8),
]);
# Ok::<(), Box<dyn std::error::Error>>(())
```
Or use overlapping style searches to find all possible occurrences:
```
use regex_automata::{hybrid::{dfa, regex::Regex}, MatchKind, MultiMatch};
// N.B. For overlapping searches, we need the underlying lazy DFA to report all
// possible matches.
let re = Regex::builder()
.dfa(dfa::Config::new().match_kind(MatchKind::All))
.build_many(&[r"\w{3}", r"\S{3}"])?;
let mut cache = re.create_cache();
let text = b"@foo bar";
let matches: Vec<MultiMatch> =
re.find_overlapping_iter(&mut cache, text).collect();
assert_eq!(matches, vec![
MultiMatch::must(1, 0, 3),
MultiMatch::must(0, 1, 4),
MultiMatch::must(1, 1, 4),
MultiMatch::must(0, 5, 8),
MultiMatch::must(1, 5, 8),
]);
# Ok::<(), Box<dyn std::error::Error>>(())
```
# When should I use this?
Generally speaking, if you can abide the use of mutable state during search,
and you don't need things like capturing groups or Unicode word boundary
support in non-ASCII text, then a lazy DFA is likely a robust choice with
respect to both search speed and memory usage. Note however that its speed
may be worse than a general purpose regex engine if you don't select a good
[prefilter](crate::util::prefilter).
If you know ahead of time that your pattern would result in a very large DFA
if it was fully compiled, it may be better to use an NFA simulation instead
of a lazy DFA. Either that, or increase the cache capacity of your lazy DFA
to something that is big enough to hold the state machine (likely through
experimentation). The issue here is that if the cache is too small, then it
could wind up being reset too frequently and this might decrease searching
speed significantly.
# Differences with fully compiled DFAs
A [`hybrid::regex::Regex`](crate::hybrid::regex::Regex) and a
[`dfa::regex::Regex`](crate::dfa::regex::Regex) both have the same capabilities
(and similarly for their underlying DFAs), but they achieve them through
different means. The main difference is that a hybrid or "lazy" regex builds
its DFA lazily during search, where as a fully compiled regex will build its
DFA at construction time. While building a DFA at search time might sound like
it's slow, it tends to work out where most bytes seen during a search will
reuse pre-built parts of the DFA and thus can be almost as fast as a fully
compiled DFA. The main downside is that searching requires mutable space to
store the DFA, and, in the worst case, a search can result in a new state being
created for each byte seen, which would make searching quite a bit slower.
A fully compiled DFA never has to worry about searches being slower once
it's built. (Aside from, say, the transition table being so large that it
is subject to harsh CPU cache effects.) However, of course, building a full
DFA can be quite time consuming and memory hungry. Particularly when it's
so easy to build large DFAs when Unicode mode is enabled.
A lazy DFA strikes a nice balance _in practice_, particularly in the
presence of Unicode mode, by only building what is needed. It avoids the
worst case exponential time complexity of DFA compilation by guaranteeing that
it will only build at most one state per byte searched. While the worst
case here can lead to a very high constant, it will never be exponential.
# Syntax
This module supports the same syntax as the `regex` crate, since they share the
same parser. You can find an exhaustive list of supported syntax in the
[documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax).
There are two things that are not supported by the lazy DFAs in this module:
* Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top
of them) can only find the offsets of an entire match, but cannot resolve
the offsets of each capturing group. This is because DFAs do not have the
expressive power necessary.
* Unicode word boundaries. These present particularly difficult challenges for
DFA construction and would result in an explosion in the number of states.
One can enable [`dfa::Config::unicode_word_boundary`] though, which provides
heuristic support for Unicode word boundaries that only works on ASCII text.
Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work
on any input.
There are no plans to lift either of these limitations.
Note that these restrictions are identical to the restrictions on fully
compiled DFAs.
# Support for `alloc`-only
This crate comes with `alloc` and `std` features that are enabled by default.
One can disable the `std` feature and still use the full API of a lazy DFA.
(You should use `std` when possible, since it permits providing implementations
of the `std::error::Error` trait, and does enable some minor internal
optimizations.)
This module does require at least the `alloc` feature though. It is not
available in any capacity without `alloc`.
*/
pub use self::{
error::{BuildError, CacheError},
id::{LazyStateID, OverlappingState},
};
pub mod dfa;
mod error;
mod id;
pub mod regex;
mod search;