Converting regular expressions to a non-deterministic state machine. Regular Expression State Machines

Settings

According to Kleene's theorem, any regular expression you can associate a finite machine, which is a formal model of the algorithm for recognizing lexemes denoted by a given regular expression. In the most general terms, a state machine-the recognizer is determined by a finite set of characteristic states of the input stream and transitions between them. A state change occurs when symbols of the input stream are received from a given alphabet in accordance with the transition function, which determines possible subsequent states based on the input symbol and the current state. Among the possible states, the initial one stands out(initial) and final(allowing) states in which the finite automaton recognizer can be, respectively, at the beginning and completion of processing tokens of the input stream. If an input sequence of symbols can generate a sequence of transitions that can take the state machine from initial state in one of the final ones, then it is considered admitting and belongs to the regular set recognized by him.

(00|11)*((01|10)(00|11)*(01|10)(00|11)*)*

Table 1

	0	1
Q1	Q4	Q2
Q2	Q3	Q1
Q3	Q2	Q4
Q4	Q1	Q3

The columns of the transition table indicate the characters of the input alphabet, and the rows correspond to the current states of the DFA. The elements of each line indicate the states of the DFA, to which it must transition from the current state upon receiving the corresponding characters of the input alphabet. In particular, from the first line of this transition table it follows that receiving the symbols 0 and 1 in the initial state Q1 translates the DFA to states Q4 and Q2, respectively.

When recognizing an input sequence using the transition table, it is easy to trace changes in the state of the DFA in order to determine whether one of the allowing states is achieved or not. In particular, for the binary vector 01001000 with an even number of zeros and ones, the considered DFA generates the following sequence of transitions, where each transition is labeled by the symbol of the input alphabet that causes it:

Q1 0 Q4 1 Q3 0 Q2 0 Q3 1 Q4 1 Q1 0 Q4 0 Q1

This sequence of transitions ends with the admitting state Q1, therefore, the binary vector 01001000 belongs to the regular set recognized by the considered DFA and satisfies the above regular expression.

In conclusion, it should be noted that the considered informal method of constructing

DKA is a special case of NKA. In him:

there is no state with ε-transitions;

for each state S and input symbol a, there is at most one arc emanating from S and labeled a.

DFA has only a maximum of one transition for any input symbol from each state. If a table is used to represent the DFA transition function, then each record will contain only one state. Thus, it is easy to check whether a given DFA admits a certain line, since there is only one path from the starting state, which is marked with this line.

Figure 3 shows the transition graph of a DFA that accepts the same language (a|b) * a(a|b)(a|b) as the NFA in Figure 1.

Figure 3. DFA allowing the string (a|b) * a(a|b)(a|b).

A deterministic finite automaton M that accepts the given language:

M = ((1, 2, 3, 4, 5, 6, 7, 8), (a, b), D, 1, (3, 5, 6, 8))

The transition function D is defined as follows:

Constructing nka using a regular expression

1. For ε the NKA has the following form (0 – initial state, 1 – final):

2. For a included in a given NKA language:

3. Let N(s) and N(t) be NFAs for regular expressions s and t.

For the regular expression s|t, the composite NFA has the following form:

b. For the regular expression st NKA:

With. For the expression s* the NFA has the form:

d. For the expression in brackets (s), the NFA N(s) is used as in point a.

Each new state receives an individual name. The construction of the NFA N(r) has the following properties:

N(r) has a number of states that does not exceed the number of symbols by more than 2 times.

N(r) has exactly one initial and one final state. The final state has no outgoing transitions.

Each state N(r) has either 1 transition for a symbol from the alphabet (), or no more than 2 outgoing ε-transitions.

Converting nka to dka.

The NFA in Figure 1 has 2 transitions from state 0 for symbol a: states 0 and 1. Such a transition is ambiguous, like the transition in ε. Modeling such satellites using a computer program is much more difficult. The definition of feasibility states that there must be some path from the initial state to the final state, but when there are multiple paths for the same input string, they must all be considered to find a path to the final state or find out that there is no such path.

In the NKA transition table, each entry corresponds to many states, but in the DKA transition table, there is only one. The essence of the transformation is that each DFA state corresponds to a set of NFA states. The DFA uses its states to track all possible states that the NFA may be in after reading the next input symbol. That is, after reading the input stream, the DFA is in a state that represents a certain set of states of the NFA, reachable from the initial one along the path corresponding to the input string. The number of such DFA states can significantly exceed the number of NFA states (exponential dependence), but in practice this is extremely rare, and sometimes there are even fewer states in DFA than in NFA.

Let's consider such a transformation using a specific example. Figure 4 shows another NFA that allows the language (a|b) * a(a|b)(a|b) (as in Figures 1 and 3).

Figure 4. NFA allowing language (a|b) * a(a|b)(a|b)

The transition from state 13 to state 14 shown in the figure can be represented similarly to the transition from the 8th to the 13th state.

Let's construct a DFA for this language. The starting state of the equivalent DFA is the state A = (0, 1, 2, 4, 7), that is, those states that can be reached from 0 by ε.

The input symbol alphabet is (a, b). From the initial state A, one can calculate the state reachable by a. Let's call this state B = (1, 2, 3, 4, 6, 7, 8, 9, 11, 13, 14).

Among the states in A, only state 4 has a transition along b to state 5, so DKA has a transition along b from A to state C = (1, 2, 4, 5, 6, 7).

If we continue this process with states B and C, all sets of states of the NFA will be marked. Thus we will have sets of states:

A = (0, 1, 2, 4, 7)

B = (1, 2, 3, 4, 6, 7, 8, 9, 11, 13, 14)

C = (1, 2, 4, 5, 6, 7)

D = (10, 12, 13, 14)

State A is the initial state, and states B, D, E are final.

The full transition table is shown below.

Below in Figure 5 is the DFA itself for this language.

Figure 5. DFA accepting language (a|b) * a(a|b)(a|b)

List of used literature:

Pentus A. E., Pentus M. R. – Theory of formal languages

A. Aho, R. Sethi, D. Ullman - Compilers: principles, technologies, tools.

In this section, we will formulate important issues related to regular languages. First you need to understand what it means to ask a question about a certain language. A typical language is infinite, so it makes no sense to present someone with strings of that language and ask a question that requires testing an infinite number of strings. It is much more reasonable to use one of the finite representations of the language, namely: DFA, NFA, ε-NFA or regular expression.

Obviously, languages represented in this way will be regular. There is really no way to represent completely arbitrary languages. The following chapters propose finite methods for describing classes wider than the class of regular languages, and it will be possible to consider questions about languages from them. However, algorithms for resolving many questions about languages exist only for the class of regular languages. These same questions become “undecidable” (there are no algorithms for answering these questions) if they are posed using more “expressive” notations (used to express a wider variety of languages) than the representations developed for regular languages.

Let's begin our study of algorithms for questions about regular languages by looking at the ways in which one representation of a language is converted to another. In particular, let's consider the time complexity of the algorithms that perform the transformations. Then we'll look at three basic questions about languages.

1. Is the language being described an empty set?

2. Does some string w belong to the represented language?

3. Are there really two? different descriptions represent the same language? (This issue is often called “equivalence” of languages.)

2.1 Conversions between different language representations

Each of the four regular language representations can be converted to any of the other three. In Fig. 3.1 shows the transitions from one representation to another. Although algorithms exist for any of these transformations, sometimes we are interested not only in the feasibility of a certain transformation, but also in the time required to complete it. In particular, it is important to distinguish algorithms that take exponential time (time as a function of the size of the input data) and therefore can only be executed for input data of relatively small sizes, from those algorithms whose execution time is linear, quadratic, or low polynomial. degree as a function of the size of the input data. The latter algorithms are “realistic” because they can be executed for a much wider class of problem instances. Let's look at the time complexity of each of the transformations discussed.

Converting NKA to DKA

The execution time for converting an NFA or ε-NFA into a DFA can be an exponential function of the number of states of the NFA. To begin with, computing the ε-closure of n states takes O(n3) time. It is necessary to find all arcs with label ε leading from each of n states. If there are n states, then there can be at most n2 arcs. Smart use of system resources and well-designed data structures ensure that the time to explore each state is less than O(n2). In fact, to evaluate the entire ε-closure once, one can use a transitive closure algorithm such as Warshall's algorithm5.

After calculating the ε-closure, we can proceed to the synthesis of DFA using the construction of subsets. The main influence on the time consumption is exerted by the number of DFA states, which can be equal to 2n. For each state, transitions can be computed in O(n3) time using ε-closure and the NFA transition table for each input symbol. Suppose we need to calculate δ((q1, q2, …, qk), a) for DFA. From each state qi one can reach at most n states along paths labeled ε, and each of these states can have at most n arcs labeled a. By creating an array indexed by states, we can compute the union of at most n sets of at most n states in time proportional to n2.

In this way, for each state qi, one can compute the set of states reachable from qi along a path labeled a (possibly including arcs labeled ε). Since k ≤ n, there are at most n such states qi, and for each of them, calculating reachable states takes O(n2) time. Thus, the total time to calculate reachable states is O(n3). Combining sets of reachable states will require only O(n2) additional time, hence computing one DFA transition takes O(n3) time.

Note that the number of input symbols is considered constant and does not depend on n. Thus, in both this and other running time estimates, the number of input symbols is not considered. The size of the input alphabet affects only the constant factor hidden in the “Big O” notation.

So, the time for converting an NKA into a DKA, including the situation when the NKA contains ε-transitions, is O(n32n). Of course, in practice, usually the number of states that are constructed is much less than 2n. Sometimes there are only n of them. Therefore, we can set the running time estimate to O(n3s), where s is the number of states that the DFA actually has.

Conversion of DKA to NKA

This is a simple transformation that takes O(n) time for a DFA with n states. All that needs to be done is to change the transition table for the DFA by enclosing the states in brackets () and also add a column for ε if you want to get an ε-NFA. Since the number of input symbols (i.e., the width of the jump table) is assumed to be constant, copying and processing the table takes O(n) time.

Converting an automaton to a regular expression

In each of n stages (where n is the number of DFA states), the size of the constructed regular expression can increase fourfold, since each expression is constructed from four expressions of the previous cycle. Thus, simply writing n3 expressions can take O(n34n) time. The improved design from Section 3.2.2 reduces the constant factor but does not affect the exponentiality of this problem (in the worst case).

A similar construction requires the same running time if an NFA, or even an ε-NFA, is converted, but this is not proven here. However, the use of the design for satellites has great importance. If you first convert the NFA into a DFA, and then the DFA into a regular expression, then it will take O(n34n32n) time, which is double exponential.

Converting a regular expression to an automaton

Converting a regular expression to an ε-NFA will require linear time. You need to parse a regular expression efficiently using a method that takes O(n) time for a regular expression of length n6. The result is a tree with one node for each character of the regular expression (although the parentheses do not appear in this tree, since they only control the parsing of the expression).

The resulting tree of a given regular expression can be processed by constructing an ε-NFA for each node. The regular expression transformation rules presented in Section 3.2.3 never add more than two states and four arcs to each node of the expression tree. Consequently, both the number of states and the number of arcs of the resulting ε-NFA are equal to O(n). Moreover, the work of creating these elements at each node of the analysis tree is constant, provided that the function processing each subtree returns pointers to the initial and admitting states of that automaton.

We come to the conclusion that constructing an ε-NFA using a regular expression takes time, linearly depending on the size of the expression. It is possible to eliminate ε-transitions from an ε-NFA with n states by transforming it into a regular NFA in O(n3) time and without increasing the number of states. However, the conversion to DKA can take exponential time.

Basic definitions Regular expressions in the alphabet Σ and the regular sets they denote are defined recursively as follows: 1) – a regular expression denoting a regular set; 2) e – regular expression denoting a regular set (e); 3) if a Σ, then a is a regular expression denoting a regular set (a); 4) if p and q are regular expressions denoting regular sets P and Q, then a) (p+q) is a regular expression denoting P Q; b) pq – regular expression denoting PQ; c) p* – regular expression denoting P*; 5) nothing else is a regular expression.

Basic definitions Prioritization: * (iteration) – highest priority; concatenation; + (union). So 0 + 10* = (0 + (1 (0*))). Examples: 1. 01 means (01); 2. 0* – (0*); 3. (0+1)* – (0, 1)*; 4. (0+1)* 011 – means the set of all chains made up of 0 and 1 and ending with the chain 011; 5. (a+b) (a+b+0+1)* means the set of all chains (0, 1, a, b)* starting with a or b.

Basic definitions of the Lemma: 1) α + β = β + α 2) * = e 3) α + (β + γ) = (α + β) + γ 4) α(βγ) = (αβ)γ 5) α( β + γ) = αβ + αγ 6) (α + β)γ = αγ + βγ 7) αe = eα = α 8) α = 9) α+α* = α* 10) (α*)* = α* 11) α+α = α 12) α+ = α

Relationship between RT and RM RM are languages generated by RT. For example: x = a+b, y = c+d, x X = (a, b), y Y = (c, d), x + y X Y = (a, b, c, d). Concatenation: xy XY = (ac, ad, bc, bd). k(u+o)t (k)(u, o)(t) = (whale, cat) or by Lemmas No. 5 and No. 6 k(u+o)t = whale + cat (whale, cat). Iteration: x = a, x* X* = (e, a, aaa, …), i.e. x* = e + xxx + …

Connection between RP and RM Iteration of concatenation and union: (xy)* = e + xyxyxy + … (x + y)* = e + (x + y)(x + y) + … = = e + xx + xy + yx + yy + xxx + … Example: 0 + 1(0+1)* (0) ((1) (0, 1)*) = (0, 1, 10, 11, 100, 101, 110, 111…). Union is commutative: x + y = y + x Concatenation is not: xy ≠ yx

Communication between RM and RM Examples for priority: x + yz (x, yz), (x + y)z (xz, yz), x + y* (e, x, y, yyy, yyyy, ...), (x + y)* (e, x, y, xx, xy, yx, yy, xxx, …), (xy)* (e, xyxy, …), xy* (x, xyy, xyyy, …). New lemmas: a* + e = a*; (a + e)* = a*; a*a* = a*; e* = e; etc.

Regular systems of equations Equation with regular coefficients X = a. X + b has a solution (smallest fixed point) a*b: aa*b + b = (aa* + e)b = a*b System of equations with regular coefficients: X 1 = α 10 + α 11 X 1 + α 12 X 2 + … + α 1 n. Xn X 2 = α 20 + α 21 X 1 + α 22 X 2 + … + α 2 n. Xn…………………………. . Xn = αn 0 + αn 1 X 1 + αn 2 X 2 + … + αnn. Xn Unknowns – Δ = (X 1, X 2, …, Xn).

Regular systems of equations Solution algorithm (Gauss method): Step 1. Set i = 1. Step 2. If i = n, go to step 4. Otherwise, write the equations for Xi in the form Xi = αXi + β (β = β 0 + βi +1 Xi+1 + … + βn. Xn). Then, on the right-hand sides for the equations Xi+1, ..., Xn, we replace Xi with the regular expression α*β. Step 3. Increase i by 1 and return to step 2. Step 4. Write the equation for Xn as Xn = αXn + β. Go to step 5 (with i = n). Step 5. The equation for Xi is Xi = αXi + β. Write at the output Xi = α*β, in the equations for Xi– 1, …, X 1, substituting α*β instead of Xi. Step 6. If i = 1, stop, otherwise decrease i by 1 and return to step 5.

Transformation of DFA into RT For DFA M = (Q, Σ, δ, q 0, F) we compose a system with regular coefficients where Δ = Q: 1. set αij: = ; 2. if δ(Xi, a) = Xj, a Σ, then αij: = αij + a; 3. if Xi F or δ(Xi,) = HALT, then αi 0: = e. After solving, the desired PV will be X 1 = q 0.

Converting DFA to RV Example: for number s fixed point we obtain the system q 0 = + q 0 + sq 1 + pq 2 + dq 3 + q 4 q 1 = + q 0 + q 1 + pq 2 + dq 3 + q 4 q 2 = + q 0 + q 1 + q 2 + q 3 + dq 4 q 3 = e + q 0 + q 1 + q 2 + dq 3 + pq 4 = e + q 0 + q 1 + q 2 + q 3 + dq 4 Here: s – sign of the number, s = "+" + "–"; p – decimal point, p = "."; d – numbers, d = “0” + “1” + … + “9”.

Converting DFA to RT Solution: q 0 = *(sq 1 + pq 2 + dq 3 + q 4 +) = sq 1 + pq 2 + dq 3 q 1 = + q 0 + q 1 + pq 2 + dq 3 + q 4 = pq 2 + dq 3, q 2 = + q 0 + q 1 + q 2 + q 3 + dq 4 = dq 4, q 3 = e + q 0 + q 1 + q 2 + dq 3 + pq 4 = dq 3 + pq 4 + e, q 4 = e + q 0 + q 1 + q 2 + q 3 + dq 4 = dq 4 + e. From the third equation: q 3 = dq 3 + pq 4 + e = d*(pq 4 + e). From fourth equation: q 4 = dq 4 + e = d*.

Conversion of DFA to RT Reverse stroke: q 3 = d*(pq 4 + e) = d*(pd* + e), q 2 = dq 4 = dd*, q 1 = pq 2 + dq 3 = pdd* + dd *(pd* + e), q 0 = sq 1 + pq 2 + dq 3 = s(pdd* + dd*(pd* + e)) + pdd* + dd*(pd* + e). Thus, this DFA corresponds to the RT s(pdd* + dd*(pd* + e)) + pdd* + dd*(pd* + e). Let's simplify: s(pdd* + dd*(pd* + e)) + pdd* + dd*(pd* + e) = = spdd* + sdd*(pd* + e) + pdd* + dd*(pd* + e) = (s + e)(pdd* + dd*(pd* + e)) For a shorter notation, you can use the positive iteration aa* = a*a = a+: (s + e)(pdd* + dd* (pd* + e)) = (s + e)(pd+ + d+pd* + d+)

Converting DFA to RT Mapping the DFA transition function graph to basic operations with regular expressions: q 0 a b a q 1 q 2 q 1 q 0 a+b a b ab q 2 a*

Converting DFA to RT More complex combinations of operations: q 0 a q 1 b b b q 0 a q 2 q 1 (a + e)b c b q 0 q 2 ab(cab)* q 0 (a + b)* q 0 a q 1 aa* = a+ q 0 a q 1 a b a a a (ab)+ q 2 b q 1 c e + (a + b)c*

Conversion of DFA to RT For RT (s + e)(pd+ + d+(pd* + e)): q 0 p q 2 d s p q 1 d d q 3 d p q 4 d q 5 d

RV programming Regular expressions: Built into many programming languages (PHP, Java. Script, ...); Implemented as plug-in components (for example, the Regex class for the .NET platform). Differences in notation forms: x? = x + e x(1, 3) = x + xxx, etc.

Programming RT Constructs of the Regex class (System. Text. Regular. Expressions): Symbol Interpretation of the Escape sequence b When used in square brackets, it corresponds to the symbol “←” (u 0008) t, r, n, a, f, v Tab (u 0009), carriage return (u 000 D), new line (u 000 A), etc. c. X Control character (eg c. C is Ctrl+C, u 0003) e Escape (u 001 B) ooo ASCII character in octal xhh ASCII character in hexadecimal uhhhh Unicode character The following character is not a special RT character. This symbol must be used to escape all special characters. Example (the example shows a pattern and a search string, the found matches are underlined in the string): @"rnw+" – "rn. There are two lines here."

Programming RV Subsets of symbols. Any character except the end of the line (n) Any character from the set [^xxx] Any character except characters from the set Any character from the range ] Subtracting one set or range from another p(name) Any character specified by the Unicode category named name P (name) Any character other than those specified by the Unicode category name w The set of characters used in specifying identifiers W The set of characters not used in specifying identifiers s Spaces S All except spaces d Numbers D Non-digits Examples: @". +" – "rn. There are ntwo lines"; @"+" – "0 xabcfx"; @"[^fx]+" – "0 xabcfx"; @"+" – "0 xabcfx"; @"[^a-f]+" – "0 xabcfx"; @"]+" – "0 xabcfx"; @"p(Lu)" – "City Lights"; // Lu – capital letters @"P(Lu)" – "City"; @"p(Is. Cyrillic)" – "ha. OS"; //Is. Cyrillic – Russian letters

Programming PB Anchor ^, A At the beginning of the line $, Z At the end of the line or before the "n" character at the end of the line z At the end of the line G Where the previous match ends b Word boundary B Any position not on a word boundary Examples: @ "G(d)" – "(1)(3)(5)(9)"; // three matches (1), (2) and (3) @"bnS*ionb" – "nation donation"; @"Bendw*b" – "end sends endure lender".

Programming RT Operations (quantifiers) *, *? Iteration +, +? Positive iteration? , ? ? Zero or one match (n), (n)? Exactly n matches (n, ), (n, )? At least n matches (n, m), (n, m)? From n to m matches Examples (the first quantifiers are greedy, they look for as many elements as possible, the second are lazy, they look for as few elements as possible): @"d(3, )" – "888 -5555"; @"^d(3)" – "913 -913"; @"-d(3)$" – "913 -913"; @"5+? 5" – "888 -5555"; // three matches - 55, 55 and 55 @"5+5" - "888 -5555".

RV programming Grouping () A group that automatically receives a number (? :) Do not save the group (?) or (? "name") If a match is found, a named group is created (?) or Delete a previously defined group and (? "name - name") saving in a new group a substring between a previously defined group and new group(? imnsx:) Enables or disables any of the five (? –imnsx:) possible options in the group: i – case insensitive; s – one line (then “.” is any character); m – multi-line mode (“^”, “$” – the beginning and end of each line); n – do not capture unnamed groups; x – exclude non-escaped spaces from the pattern and include comments after the number sign (#) (? =) A zero-length positive lookahead statement

RT programming (? !) Negative zero-length lookahead statement (?) Non-returnable (greedy) part of an expression Examples: @"(an)+" – "bananas annals"; @"an+" – "bananas annals"; // compare, three matches - an, an and ann @"(? i: an)+" - "ba. NAnas annals"; @"+(? =d)" – "abc xyz 12 555 w"; @"(?

Src="https://site/presentation/-112203859_437213351/image-24.jpg" alt="RV Programming Links number Link to group k Link to named group Examples: @"> Программирование РВ Ссылки число Ссылка на группу k Ссылка на именованную группу Примеры: @"(w)1" – "deep"; @"(? w)k " – "deep". Конструкции изменения | Альтернатива (соответствует операции объединения) (? (выражение)да|нет) Сопоставляется с частью «да» , если выражение соответствует; в противном случае сопоставляется с необязательной частью «нет» (? (имя)да|нет), Сопоставляется с частью «да» , если названное имя (? (число)да|нет) захвата имеет соответствие; в противном случае сопоставляется с необязательной частью «нет» Пример: @"th(e|is|at)" – "this is the day";!}

Programming RV Substitutions $number Replaces the part of the string corresponding to the group with the specified number $(name) Replaces the part of the string corresponding to the group with the specified name $$ Substitutes $ $& Replaces with a copy of the full match $` Replaces the text of the input string until it matches $" Replaces the text of the input lines after match $+ Replace last captured group $_ Replace entire line Comments (? #) Inline comment # Comment to end of line

Programming RT Results of Regex: Regex Matches() Match. Collection Match Groups() Group. Collection Group Captures() Capture. Collection Captures()

Programming RV Example in C++ CLI (Visual C++/CLR/CLR Console Application): int main() ( Regex ^r = gcnew Regex(L"((\d)+)+"); Match ^m = r-> Match(L"123 456"); int match. Count = 0; while (m->Success) ( Console: : Write. Line(L"Match (0)", ++match. Count); for (int i = 1; i Groups->Count; i++) ( Group ^g = m->Groups[i]; Console: : Write. Line(L" Group (0) = "(1)"", i, g-> Value); for (int j = 0; j Captures->Count; j++) ( Capture ^c = g->Captures[j]; Console: : Write. Line(L" Capture (0) = "(1)" , position = (2), length = (3)", j, c, c->Index, c->Length); ) ) m = m->Next. Match(); ) return 0; ) System: : Text: : Regular. Expressions

Enabling actions and finding errors Limiting the number of significant digits in a number: (s + e)(pd+ + d+(pd* + e)) s = +|p = . d = d s + e = s? = (+|-)? pd* + e = (pd*)? = (.d*)? @"(+|-)? (. d+|d+(. d*)?)" or @"^(+|-)? (. d+|d+(. d*)?)$" Regex r = new Regex (@"^(+|-)? (. (? "digit"d)+|(? "digit"d)+(. (? "digit"d)*)?)$"); Match m = r. Match("+1. 23456789"); if (m. Success) ( Group g = m. Groups["digit"]; if (g. Captures. Count

Enabling actions and finding errors Determining the error position: Regex r = new Regex(@"(+|-)? (. (? "digit"d)+|(? "digit"d)+(. (? "digit"d )*)?)"); string str = "+1. 2345!678"; Match m = r. Match(str); if (m. Success) ( Group g = m. Groups["digit"]; if (g. Captures. Count 0) Console. Write. Line("Error at position 1: unexpected character "(0)"", str ); else if (m. Length

Enabling actions and searching for errors Determining the position of the error: 1. the first position of the input chain (1), if the first match does not start from the position Index = 0; 2. the position following the last match (match. Length + 1), if it does not match the last position of the input chain; 3. position of the first gap between matches (match[i]. Index + match[i]. Length + 1), if the character following the previous match is not the first character of the next match.

Index) break; index = m[i]. Index + m[i]. Length; ) Console. Write. Line("Error at position (0) "(1)"", index + 1, str); ) "abc. xyz. pqr" – correct; "+abc. xyz. pqr” – error in position 1 (“+”); "abc. xyz. pqr! – error in position 12 (“!”); "abc. xyz!. pqr" – error in position 8 (“!”).

Enabling actions and searching for errors But! "abc. xyz. +pqr” – error at position 8 (“.”). New template option: @"w+(. w+)*(. (? !$))? " Validation: "abc. xyz. +pqr” – error in position 9 (“+”); "abc. xyz. pqr. " – error in position 12 (".").

Balanced definitions: "(? "x")" adds one element to the collection named "x"; “(? “-x”)” removes one element from the collection “x”; "(? (x)(? !))" checks that there are no elements left in the collection "x". L language describing Pascal nested operators “begin end; ": @"^s*((? "begins+)+(? "-begin"ends*; s*)+)*(? (begin)(? !))$".

Required by non-deterministic finite state machine M = (Q, T, D, q 0 , F) build a deterministic finite automaton M = (Q", T, D", q" 0 , F"). The initial state for the automaton under construction is the ε-closure of the initial state of the original automaton. ε-closure is a set of states that are reachable from a given one by transitions along ε. Further, while there are states for which transitions have not been constructed (transitions are made along symbols for which transitions exist in the original automaton), for each symbol the ε-closure of the set of states that are reachable from the considered state by transition along the considered symbol is calculated. If a state that corresponds to the found set already exists, then a transition is added there. If not, then the new received state is added.

Example

Initialization

The states corresponding to the ε-closure of the initial state are marked. These states will correspond to state A of the future DKA.

First iteration

From the ε-closure there are transitions to the states of NKA 3 and 10 (according to a And c, respectively). For state 3, the ε-closure is the set of states (3, 4, 6), for state 10 - (10). Let us denote the new states of the DFA corresponding to these sets as B and C .

DKA state Set of NKA statesabcA B C

{1, 2, 9}	B	-	C
{3, 4, 6}	-	-	-
{10}	-	-	-

Second iteration

From the set of states of the NKA (3, 4, 6), corresponding to the state of the DKA B, there are two transitions - to state 5 (by b) and 7 (by c). Their ε-closures intersect, but the sets themselves are different, so they are associated with two new DFA states - D and E. There are no transitions from the NKA states corresponding to the DKA state C .

DFA state Set of NFA states Symbols by which the transition is carried outabcA B C D E

{1, 2, 9}	B	-	C
{3, 4, 6}	-	D	E
{10}	-	-	-
{2, 5, 8, 9}	-	-	-
{2, 7, 8, 9}	-	-	-

Third iteration

From the sets of NFA states corresponding to the DFA states D and E, transitions are made to the sets of states corresponding to existing states (from the set (2, 5, 8, 9) corresponding to state D, according to a transition to state 3, belonging to the set (3, 4, 6), corresponding to the state of DFA B, by c- transition to state 10, corresponding to state C; similarly for the set corresponding to the DFA state E ). The process of constructing a table of states and transitions of the DKA is completed.

DFA state Set of NFA states Symbols by which the transition is carried outabcA B C D E

{1, 2, 9}	B	-	C
{3, 4, 6}	-	D	E
{10}	-	-	-
{2, 5, 8, 9}	B	-	C
{2, 7, 8, 9}	B	-	C

Result:

Construction of a right-linear grammar using a finite automaton

We assign a nonterminal to each state. If there is a transition from the state X in a state Y By A, add a rule X → aY. Add rules for final states X→ ε. For ε-transitions - X → Y.

Example 1 (deterministic finite state machine)

A → a B | c C
B → b D | c E
C → ε
D → a B | c C
E → a B | c C

Example 2 (non-deterministic finite state machine)

1 → 2 | 9
2 → a 3
3 → 4 | 6
4 → b 5
5 → 8
6 → c 7
7 → 8
8 → 2 | 9
9 → c 10
10 → ε

Construction of DFA from RT

Let there be a regular expression r. Using this regular expression, it is necessary to build a deterministic finite automaton D such that L(D) = L(r).

Regular Expression Modification

Let's add to it a symbol indicating the end of the RT - “#”. As a result, we get the regular expression ( r)#.

Building a tree

Let's imagine a regular expression as a tree, the leaves of which are terminal symbols, and the internal vertices are the operations of concatenation ".", union "∪" and iteration "*". We will assign a unique number to each leaf of the tree (except for ε-leaves) and refer to it, on the one hand, as a position in the tree and, on the other hand, as the position of the symbol corresponding to the leaf.

Evaluation of functions nullable, firstpos, lastpos

Now, traversing the tree T from bottom to top from left to right, we calculate three functions: nullable, firstpos, And lastpos. Functions nullable, firstpos And lastpos defined at the nodes of the tree. The meaning of all functions except nullable, is a set of positions. Function firstpos(n) for each node n of the regular expression syntax tree gives the set of positions that correspond to the first characters in the substrings generated by the subexpression with vertex at n. Likewise, lastpos(n) gives the set of positions that correspond to the last characters in the substrings generated by the subexpressions with the vertex n. For nodes n, whose subtrees (i.e. the tree whose node n is a root) can generate an empty word, we define nullable(n) = true, and for the remaining nodes false. Calculation table nullable, firstpos, lastpos:

node nnullable(n) firstpos(n) lastpos(n)

ε	true	∅	∅
i ≠ ε	false	{i}	{i}
u∪v	nullable(u) or nullable(v)	firstpos(u) ∪ firstpos(v)	lastpos(u) ∪ lastpos(v)
u. v	nullable(u) and nullable(v)	if nullable(u) then firstpos(u) ∪ firstpos(v) else firstpos(u)	if nullable(v) then lastpos(u) ∪ lastpos(v) else lastpos(v)
v*	true	firstpos(v)	lastpos(v)

Building followpos

Function followpos calculated via nullable, firstpos And lastpos. Function followpos defined in many positions. Meaning followpos is a set of positions. If i- position, then followpos(i) there are many positions j such that there is a certain string... CD..., included in the language described by the RT, such that i matches this occurrence c, A j- occurrence d. Function followpos can also be calculated in one traversal of the tree using the following two rules

Let n- internal node with the operation "." (concatenation); a, b- his descendants. Then for each position i included in lastpos(a followpos(i) a bunch of firstpos(b).

Let n- internal node with operation “*” (iteration), a- his descendant. Then for each position i included in lastpos(a), added to the set of values followpos(i) a bunch of firstpos(A).

Example

Calculate function value followpos for regular expression ( a(b|c))*c.

Position Meaning followpos1: (a (b|c))*c2: (a(b |c))*c3: (a(b|c ))*c4: (a(b|c))*c

{2, 3}

{1, 4}

{5}

Construction of DFA

DFA represents a set of states and a set of transitions between them. The DKA state represents many positions. The construction of a DFA consists of gradually adding the necessary states to it and constructing transitions for them. Initially there is one state, firstpos(root) (root- the root of a tree) for which no transitions have been built. The transition is carried out using characters from the regular expression. Each character corresponds to a set of positions ( p i). Uniting everyone followpos(x) is the state to which it is necessary to move, where x is a position present both among the positions of the state and among the positions of the symbol from the RF through which the transition is made. If there is no such state, then it must be added. The process must be repeated until all transitions for all states have been constructed. All states containing the position of the # symbol added to the end of the PB are declared final.

DKA obtained from RV ( a(b|c))*c

Example

Construct a DFA using a regular expression ( a(b|c))*c.

DKA status Symbola (1) b (2) c (3, 4)

A (1, 4)	B (2, 3)	-	C (5)
B (2, 3)	-	A (1, 4)	A (1, 4)
C (5)	-	-	-

Construction of a DFA with a minimum number of states

A DFA with a minimum number of states is constructed for an everywhere defined DFA, i.e. such that . For any DFA there is an equivalent everywhere defined DFA.

Construction of an everywhere defined DFA

Let's introduce a new state and define a new set of states .

Let's define a new transition function like this:

Construction of a partition (formally)

Let's construct the initial partition P sets of states into two groups: final states F and the rest S\F, i.e. P = {F, Q\F}.

Apply to each group G ∈ P the following procedure. Smash G into subgroups so that the states s And t from G are in the same group if and only if for each input symbol a state s And t have transitions along a into states from the same group into P.

We add the resulting subgroups to a new partition P new.

We accept P = P new and repeat the construction of the partition until stabilization, i.e., until the partition stops changing.

Construction of a partition (algorithm)

To construct a partition, there is the following algorithm. We build a transition table for the original automaton, we build the initial partition.

We assign an identifier to each group from the partition (for the initial partition, for example, 0 and 1).

To each state (each row of the table) we assign a string of the form “a.bcd...xyz”, where is the identifier of the group to which the state belongs from [the first column (from where we go), the second column (where we go by the first character), ..., the last column (where we go by the last character)].

We build new groups based on the coincidence of strings, i.e., so that states with identical assigned strings fall into one group.

Then, a new iteration. We assign new identifiers to the resulting groups, for example (0, 1, ..., n). And we repeat the construction of the partition until stabilization.

Note that when there is only one state left in the group, at subsequent stages of constructing the partition there is no need to write out a string of identifiers for this state, since this string will in any case be unique due to the first character of the string. In other words, when splitting, nothing will happen to a group from one state - it is transferred to the new split as is.

Construction of a reduced automaton

Each of the resulting groups becomes the state of a new DFA. If a group contains the initial (final) state of the original automaton, this group becomes the initial (respectively final) state of the new DFA. Transitions are constructed in an obvious way: a transition to a state from a group is considered a transition to a group.

We bring the machine gun. We first remove non-generating (sterile, “dead”), then unattainable states (the definitions are given for symbols, but obviously carry over to the states of the automaton).

Generally speaking, removing dead states turns a DFA into an NFA, since in a DFA all transitions must be defined. However, in the Dragon Book such a deviation from the formal definition is still considered acceptable.

Example

To construct a DFA with a minimum state number for a DFA of the following form:

Initial partition: (C) ( final state), (A, B, D, E) ( all other states).
(C) ( without changes), (A, D, E), (B), ( since from A, D, E we go by a, c to B and C, respectively).
We can't make any more splits.

Let group (C) correspond to state C, group (A, D, E) correspond to state A, and group (B) correspond to state B. Then we obtain a DFA with a minimum number of states:

Example (algorithm for constructing a partition)

Transition table for an everywhere defined (state Z added) DFA corresponding to the RT (ab|ε)a*|abb|b*a. From the 2012 exam tasks.

a b I 0 I 1

→A*	B	C	0.01	0.12
B*	D	E	0.00	1.01
C	F	C	1.01
D*	D	Z	0.01	0.04
E*	D	Z	0.00	1.03
F*	Z	Z	0.11
Z	Z	Z	1.11

Iterations:

I 0: ABCDEF(0), CZ(1).
I 1: AD(0), BE(1), C(2), F(3), Z(4).
I 2: A, B, C, D, E, F, Z.

Result: the machine already had a minimum number of states :-)

DKA allowing tongue addition

The algorithm for constructing a DFA that accepts the complement of the language L (L̅) consists of two steps:

Construction of a complete DFA
Constructing the desired automaton from it

In fact, there is no such thing as complete DKA. Under full DKA some teachers understand an automaton whose transition table has no empty cells. However, in accordance with the definition of DFA - δ: Q × Σ → Q - there cannot be empty cells in any case. An automaton “with empty cells,” however, satisfies the definition of an NFA. During the solution, we often end up with just such an NFA that only lacks transitions to the DFA.

In order to replenish it, it is enough to add a new state X, and “instead of” non-existent transitions add transitions to this new state. At the same time, you need to remember to add transitions from X V X. It is easy to notice that where the original automaton did not accept a certain chain due to the absence of a transition, the new automaton will go into the state X and get stuck in it. Since the new state is not accepting (final), the new machine will also not accept this chain.

Now, in order to build the required automaton, all that is required is to change the roles of the accepting and non-accepting states. In other words, F" = Q\F.

Building an LL(k) analyzer Grammar transformation

Not every grammar is LL(k)-analyzable. A context-free grammar belongs to class LL(1) if it does not have conflicts of the FIRST-FIRST type (left recursion is a special case of such a conflict) and FIRST-FOLLOW.

Sometimes it is possible to convert non-LL(1) grammars so that they become LL(1). Some (more precisely, those discussed in the course) transformations are given below.

Removing left recursion

Let us have a rule of the form (hereinafter in this section, capital letters - non-terminal symbols, lowercase - chains of any characters):

A → A a| A b| ... | A k | m | n | … | z

It does not lend itself to unambiguous analysis, so it must be transformed.

It is easy to show that this rule is equivalent to the following pair of rules:

A → m B | n B | ... | z B
B → a B | b B | ... | k B | ε

Left factorization

The essence of this procedure is to eliminate ambiguity in the choice of rules based on the left symbol. To do this, a common left prefix is found and what can follow it is put into a new rule (lowercase letters - chains of any characters)

Example

A → a c | a df | a dg | b

Converts to

A → a B | b
B → c | d f | d g

Which in turn will turn into

A → a B | b
B → c | d WITH
C → f | g

Grammar Conversion Example

G= ((S, A, B), (a, b, c), P, S)

S → SAbB | a
A → ab | aa | ε
B → c | ε

Removing left recursion for S:

S → aS 1
S 1 → AbBS 1 | ε

Left factorization for A:

A → aA 1 | ε
A 1 → b | a

Final grammar:

S → aS 1
S 1 → AbBS 1 | ε
A → aA 1 | ε
A 1 → b | a
B → c | ε

Construction of FIRST and FOLLOW

FIRST(α), where α ∈ (N ∪ T)* is the set of terminals from which α can begin. If α ⇒ ε, then ε ∈ FIRST(α). Accordingly, the value of the FOLLOW( A) for a nonterminal A- many terminals that can appear immediately after A in any sentential form. If A may be the rightmost character in some sentential form, then the final marker $ also belongs to FOLLOW( A)

FIRST Calculation For Terminals

For any terminal x, x ∈ T,FIRST( x) = {x}

For non-terminals

If X is a nonterminal, then we put FIRST( X) = {∅}
If there is a rule in grammar X→ ε, then add ε to FIRST( X)
For each non-terminal X and for each inference rule X → Y 1 …Y k will be added to FIRST( X) the set FIRST of all symbols on the right side of the rule up to the first one from which ε is not derived, including it

For chains

For a string of characters X 1 …X k FIRST is the union of FIRST symbols included in the chain up to and including the first one for which ε ∉ FIRST.

Example

S → aS 1
S 1 → AbBS 1 | ε
A → aA 1 | ε
A 1 → b | a
B → c | ε

FIRST non-terminals in order of dependency resolution:

FIRST(S) = (a)
FIRST(A) = (a, ε)
FIRST(A 1) = (b, a)
FIRST(B) = (c, ε)
FIRST(S 1) = (a, b, ε)

FIRST for inference rules:

FIRST(aS 1) = (a)
FIRST(AbBS 1) = (a, b)
FIRST(ε) = (ε)
FIRST(aA 1) = (a)
FIRST(a) = (a)
FIRST(b) = (b)
FIRST(c) = (c)

FOLLOW calculation

Evaluating the FOLLOW function for character X:

Let FOLLOW(X) = (∅)
If X is an axiom of grammar, then add a marker $ to FOLLOW
For all rules of the form A → αXβ, add FIRST(β)\(ε) to FOLLOW(X) (X can be followed by the characters that start β)
For all rules of the form A → αX and A → αXβ, ε ∈ FIRST(β), add FOLLOW(A) to FOLLOW(X) (that is, X can be followed by all symbols that can follow A, if in In the inference rule, the symbol X may be the rightmost one)
Repeat the previous two steps until it is possible to add characters to the set

Example

S → aS 1
S 1 → AbBS 1 | ε
A → aA 1 | ε
A 1 → b | a
B → c | ε

Result:

FOLLOW(S) = ($)
FOLLOW(S 1) = ($) (S 1 is the rightmost character in the rule S → aS 1)
FOLLOW(A) = (b) (after A in rule S 1 → AbBS 1 comes b)
FOLLOW(A 1) = (b) (A 1 is the rightmost character in the rule A → aA 1, therefore, add FOLLOW(A) to FOLLOW(A 1))
FOLLOW(B) = (a, b, $) (add FIRST(S 1)\(ε) (follows from the rule S 1 → AbBS 1), FOLLOW(S 1) (since there is S 1 → ε))

Compiling a table

In the table M for a non-terminal-terminal pair (in cell M[A, a]) indicates the rule by which it is necessary to convolve the input word. The table is filled in as follows: for each inference rule of a given grammar A → α (where α is the chain on the right side of the rule), the following actions are performed:

For each terminal a∈ FIRST(α) add rule A → α To M[A, a]

If ε ∈ FIRST(α), then for each b∈FOLLOW( A) add A → α To M[A, b]

ε ∈ FIRST(α) and $ ∈ FOLLOW( A), add A → α To M[A, $]

All empty cells - an error in the input word

Example

Build a table for grammar

S → aS 1
S 1 → AbBS 1 | ε
A → aA 1 | ε
A 1 → b | a
B → c | ε

Result:

a b c $S S 1 A A 1 B


S → aS 1 (First rule, conclusion S → aS 1 , a ∈ FIRST(aS 1))	Error (Fourth rule)	Error (Fourth rule)	Error (Fourth rule)
S 1 → AbBS 1 (First rule, output S 1 → AbBS 1 , a ∈ FIRST(AbBS 1))	S 1 → AbBS 1 (First rule, output S 1 → AbBS 1 , b ∈ FIRST(AbBS 1))	Error (Fourth rule)	S 1 → ε (Third rule, conclusion S 1 → ε, ε ∈ FIRST(ε), $ ∈ FOLLOW(S 1))
A → aA 1 (First rule, conclusion A → aA 1 , a ∈ FIRST(aA 1))	A → ε (Second rule, output A 1 → ε, b ∈ FOLLOW(A 1))	Error (Fourth rule)	Error (Fourth rule)
A 1 → a (First rule, conclusion A 1 → a, a ∈ FIRST(a))	A 1 → b (First rule, output A 1 → b, b ∈ FIRST(b))	Error (Fourth rule)	Error (Fourth rule)
B → ε	B → ε (Second rule, conclusion B → ε, a ∈ FOLLOW(B))	B → c (First rule, conclusion B → c, c ∈ FIRST(c))	B → ε (Third rule, conclusion B → ε, $ ∈ FOLLOW(B))

Parsing a string

The process of parsing a string is quite simple. Its essence is as follows: at each step the top character is read v c input chain.

If v- terminal character
- If v coincides with With, then they are both destroyed, a shift occurs
- If v does not match With, then a parse error is signaled
If v- non-terminal symbol, c returns to the beginning of the line instead v the right side of the rule, which is taken from the table cell, is returned to the stack M[v, c]

The process ends when both the line and the stack have reached the end marker (#).

Example

Let's parse the string "aabbaabcb":

stack string action

S#	a abbaabcb$	S → aS 1
a S 1 #	a abbaabcb$	shift
S 1#	a bbaabcb$	S 1 → AbBS 1
A bBS 1#	a bbaabcb$	A → aA 1
a A 1 bBS 1 #	a bbaabcb$	shift
A 1 bBS 1 #	b baabcb$	A 1 → b
b bBS 1 #	b baabcb$	shift
bBS 1#	b aabcb$	shift
B S 1#	a abcb$	B → ε
S 1#	a abcb$	S 1 → AbBS 1
A bBS 1#	a abcb$	A → aA 1
A bBS 1#	a abcb$	A → aA 1
a A 1 bBS 1 #	a abcb$	shift
A 1 bBS 1 #	a bcb$	A 1 → a
a bBS 1#	a bcb$	shift
bBS 1#	b cb$	shift
B S 1#	c b$	B → c
c S 1 #	c b$	shift
S 1#	b$	S 1 → AbBS 1
A bBS 1#	b$	A → ε
bBS 1#	b$	shift
B S 1#	$	B → ε
S 1#	$	S 1 → ε
#	$	ready

Building an LR(k) analyzer Calculating k in LR(k)

There is no algorithm that would allow one to calculate k in the general case for an arbitrary grammar. It's usually worth trying to build an LR(1) parser. If it has no more than one operation for each set (Shift, Reduce or Accept), then the grammar is LR(0). If, when constructing an LR(1) analyzer, a conflict and collision arises, then this grammar is not LR(1) and it is worth trying to construct LR(2). If it is not possible to construct it, then LR(3) and so on.

Replenishment of grammar

Let's add a new rule S" → S, and make S" an axiom of grammar. This additional rule is required to determine when the analyzer terminates and when the input string is allowed. Tolerance occurs if and only if it is possible to perform convolution according to the rule S → S."

Construction of a canonical system of sets of admissible LR(1) situations

At the beginning there is a set I 0 with the analyzer configuration S" → .S, $. Next, a closing operation is applied to this configuration until, as a result of its application, no more new configurations are added. Next, transitions to new sets are constructed by shifting the point by one character to the right (transitions are made along the character that stood after the point before the transition and before it after the transition), and those configurations that were obtained from the existing ones in this way are added to these sets. For them, the closing operation is also applied, and the whole process is repeated until no more new sets appear.

Example

Construct a canonical system of sets of admissible LR(1) situations for the specified grammar:

S" → S
S → ABA
A → Aa | ε
B → cBc | d

Solution:

We build a closure for the configuration S" → .S, $:
- S → .ABA, $
For the resulting configurations (S → .ABA, $) we also construct a closure:
- A → .Aa, c
- A → .Aa, d
- A → ., c
- A → ., d
For the resulting configurations (A → .Aa, c; A → .Aa, d) we also construct a closure:
- A → .Aa, a
- A → ., a
It is impossible to build more configurations in state I 0 - the closure is built

From I 0 you can make transitions along S and A and get a set of configurations I 1 and I 2, consisting of the following elements:
- I 1 = (S" → S., $)
- I 2 = (S → A.BA, $; A → A.a, c; A → A.a, d; A → A.a, a)
I 1 does not require closure
Let's construct the closure I 2:
- B → .cBc, a
- B → .cBc, $
- B → .d, a
- B → .d, $
All other sets are constructed similarly.

Building an analyzer table

The final stage of building an LR(1) analyzer is building tables Action And Goto. Table Action is built for the characters of the input line, that is, for the terminals and the end-of-line marker $, table Goto is built for grammar symbols, that is, for terminals and non-terminals.

Building a Goto Table

The Goto table shows what state to go to when encountering the next grammar symbol. Therefore, if in the canonical system of sets there is a transition from I i V I j by symbol A, then in Goto( I i, A) we put the state I j. After filling the table, we assume that in all empty cells Goto( I i, A) = Error

Building the Actions table

If there is a transition via terminal a from state I i to state I j, then Action(I i, a) = Shift(I j)
If A ≠ S" and there is a configuration A → α., a, then Action(I i , a) = Reduce(A → α)
For a state I i in which there is a configuration S" → S., $, Action(I i, $) = Accept
For all empty cells Action(I i , a) = Error

Example

Build Action and Goto tables for grammar

S" → S
S → ABA
A → Aa | ε
B → cBc | d

Solution:

Action Gotoa c d $ S S" A B a c d



I 0	Reduce(A → ε)	Reduce(A → ε)	Reduce(A → ε)		I 1	I 2
I 1				Accept
I 2	Shift(I 6)	Shift(I 4)	Shift(I 5)				I 3	I 6	I 4	I 5
I 3	Reduce(A → ε)			Reduce(A → ε)		I 13
I 4		Shift(I 8)	Shift(I 9)				I 7		I 8	I 9
I 5	Reduce(B → d)			Reduce(B → d)
I 6	Reduce(A → Aa)	Reduce(A → Aa)	Reduce(A → Aa)
I 7		Shift(I 10)							I 10
I 8		Shift(I 8)	Shift(I 9)				I 11		I 8	I 9
I 9		Reduce(B → d)
I 10	Reduce(B → cBc)			Reduce(B → cBc)
I 11		Shift(I 12)							I 12
I 12		Reduce(B → cBc)
I 13	Shift(I 14)			Reduce(S → ABA)				I 14
I 14	Reduce(A → Aa)			Reduce(A → Aa)

Parsing the chain

At each step, the top symbol v is read from the analyzer stack and the last symbol c of the input chain is taken.

If the table of actions at the intersection of v and c contains:

Shift(I k ), then c is pushed onto the stack and then I k . This removes c from the string.
Reduce(A → u), then all terminal and non-terminal symbols that make up the chain u are cleared from the top of the stack, after which the state I m remaining at the top is looked at. According to the transition table, at the intersection of I m and A, the next state I s is found. After which A is put on the stack, and then I s. The line remains unchanged.
Accept, then the analysis is completed
emptiness - error

Example

Let's parse the string aaaccdcc:

Stack	Line	Action
I 0	aaccdcc$	Reduce(A → ε), goto I 2
I 0 A I 2	aaccdcc$	Shift(I 6)
I 0 A I 2 a I 6	aaccdcc$	Reduce(A → Aa), goto I 2
I 0 A I 2	aaccdcc$	Shift(I 6)
I 0 A I 2 a I 6	a ccdcc$	Reduce(A → Aa), goto I 2
I 0 A I 2	a ccdcc$	Shift(I 6)
I 0 A I 2 a I 6	cdcc$	Reduce(A → Aa), goto I 2
I 0 A I 2	cdcc$	Shift(I 4)
I 0 A I 2 c I 4	c dcc$	Shift(I 8)
I 0 A I 2 c I 4 c I 8	dcc$	Shift(I 9)
I 0 A I 2 c I 4 c I 8 d I 9	c c$	Reduce(B → d), goto I 11
I 0 A I 2 c I 4 c I 8 B I 11	c c$	Shift(I 12)
I 0 A I 2 c I 4 c I 8 B I 11 c I 12	c$	Reduce(B → cBc), goto I 7
I 0 A I 2 c I 4 B I 7	c$	Shift(I 10)
I 0 A I 2 c I 4 B I 7 c I 10	$	Reduce(B → cBc), goto I 3
I 0 A I 2 B I 3	$	Reduce(A → ε), goto I 13
I 0 A I 2 B I 3 A I 13	$	Reduce(S → ABA), goto I 1
I 0 S I 1	$	Accept

Translation of arithmetic expressions (Set-Ullman algorithm)

Note. The code is generated by doggy style Motorola-like, that is

Op Arg1, Arg2

stands for

Arg2 = Arg1 Op Arg2

Building a tree

The tree is built as usual for an arithmetic expression: At the root is the operation with the lowest priority, followed by operations with a slightly higher priority, and so on. Parentheses have the highest priority. If there are several operations with the same priority - a op b op c, then the tree is built as for the expression (a op b) op c.

Example

Construct a tree for the expression a + b / (d + a − b × c / d − e) + c × d

Solution: Let's write the expression in the form

((a) + ((b) / ((((d) + (a)) − ((b) × (c)) / (d)) − (e)))) + ((c) × ( d))

Then there will be an operation at the root of each subtree, and the expressions in brackets to the left and right of it will be its subtrees. For example, for the subexpression ((b) × (c)) / (d), the root of the corresponding subtree will have the operation “/”, and its subtrees will be the subexpressions ((b) × (c)) and (d).

Tree layout (calculating the number of registers)

If the vertex is the left leaf (that is, a variable), then we mark it with zero.
If the vertex is the right leaf, then we mark it with one
If we have marked both of its subtrees for a vertex, then we mark it as follows:
- If the left and right subtrees are marked with different numbers, then choose the largest of them
- If the left and right subtrees are labeled with the same numbers, then we associate with this subtree a number that is one greater than the one with which the subtrees are labeled

Labeling leaves Labeling a tree with identical subtrees The left subtree is labeled with a large number The right subtree is labeled with a large number

to the right - the same as the ancestor.

If the label of the left child is greater than the label of the right child, then the right child is assigned a register one greater than the ancestor, and the left child is assigned the same register as the ancestor.

The code is generated by traversing the tree from bottom to top as follows:

1. For a vertex with label 0, the code is not generated

2. If the vertex is sheet X with label 1 and register R i, then the code is matched to it

MOVE X, Ri

3. If the vertex is internal with register R i and its left child is sheet X with label 0, then the code corresponds to it

Op X, Ri

4. If subtrees of vertices with register R i- not leaves and the label of the right vertex is greater than or equal to the label of the left one (which has register Rj, j = i + 1), then the code corresponds to the vertex

Op Rj, Ri

5. If subtrees of vertices with register R i- not leaves and the label of the right vertex (whose register Rj, j = i + 1) is less than the label of the left one, then the code corresponds to the vertex

The register distribution is shown in the graph on the right. Generated code:

MOVE d, R0 ;R0 = d MOVE c, R1 ;R1 = c MUL b, R1 ;R1 = (b × c) DIV R1, R0 ;R0 = (b × c) / d MOVE a, R1 ;R1 = a ADD d, R1 ;R1 = a + d SUB R1, R0 ;R0 = (a + d) − ((b × c) / d) MOVE e, R1 ;R1 = e SUB R0, R1 ;R1 = ((a + d) − ((b × c) / d)) − e MOVE R1, R0;R0 = ((a + d) − ((b × c) / d)) − e DIV b, R0 ;R0 = b / (((a + d) − ((b × c) / d)) − e) ADD a, R0 ;R0 = a + (b / (((a + d) − ((b × c) / d )) − e)) MOVE d, R1 ;R1 = d MUL c, R1 ;R1 = c × d ADD R0, R1 ;R1 = (a + (b / (((a + d) − ((b × c ) / d)) − e))) + (c × d) MOVE R1, R0;R0 = (a + (b / (((a + d) − ((b × c) / d)) − e) )) + (c × d)

Translation of logical expressions

This section shows how to generate code to lazily evaluate Boolean expressions. The result of the algorithm is a piece of code that, using the TST, BNE, BEQ operations, calculates a logical expression by going to one of the labels: TRUELAB or FALSELAB.

Building a tree

The tree of a logical expression reflects the order of its evaluation in accordance with the priority of operations, that is, to evaluate the value of a certain tree node (which is an operation on two operands that are subtrees of the node), we must first evaluate the values of its subtrees.

Operator priority: The NOT operator has the highest priority, followed by AND, and then OR. If the expression uses other logical operations, then they must be expressed through these three in a certain way (usually, there are no other operations and no transformation of the expression is required). Associativity for operations of the same priority is from left to right, that is, A and B and C are considered as (A and B) and C

Example

Construct a tree for the logical expression not A or B and C and (B or not C).

Solution: See diagram on the right.

For each vertex of the tree, 4 attributes are calculated:

Node number
Label where to go if the expression in the node is false (false label, fl)
Label where to go if the expression in the node is true (true label, tl)
Sign mark (see below for more details)

The vertices are numbered in any order; the only condition is that the node numbers are unique.

The tree is marked as follows:

fl indicates the number of the vertex to which the transition is made, or falselab if this vertex is false
tl indicates the number of the vertex to which the transition is made or truelab if this vertex is true

Sign indicates when to stop computing the current subtree.

For the root of the tree fl=falselab, tl=truelab, sign=false.

Thus:

Example

Label the tree constructed for the logical expression not A or B and C and (B or not C).

Code generation

Machine commands used in the generated code:

TST - checks the argument for truth and sets a flag if the argument is false
BNE - jump by label if the flag is not set, that is, the condition checked using TST is true
BEQ - jump by label if the flag is set, that is, the condition checked using TST is false

The code is constructed as follows:

the tree is traversed from the root, for AND and OR the left subtree is traversed first, then the right one
For each vertex passed, its number (label) is printed
for sheet A(number, tl, fl, sign) TST A is printed
- if sign == true, BNE tl is printed
- if sign == false, BEQ fl is printed

Example

For the expression discussed above, the following code will be generated:

1:2:4: TST A BEQ TRUELAB 3:5:7: TST B BEQ FALSELAB 8: TST C BEQ FALSELAB 6:9: TST B BNE TRUELAB 10:11: TST C BNE FALSELAB TRUELAB: FALSELAB:

Pattern matching method

The idea of the method is that for the same section of the program the code can be generated in different ways, and, as a result, optimization can be achieved for one or another parameter.

Formulation of the problem

There are many samples, for each of which a piece of the intermediate representation for which it is applicable, a weight and the generated code are defined. There is an intermediate representation tree, which is a fragment of the program for which the code needs to be generated. The goal is to construct a covering of the intermediate representation tree with samples such that the total weight of the samples is minimal.

Samples are assembly instructions and parse trees that correspond to them. For each sample, the execution time (in clock cycles) is known. With their help, we will generate optimal (in terms of execution time) code.

Sample Samples Building an Intermediate Representation

First, we build a parse tree for the entire expression.

Coverage construction

Now for each vertex (we sort them out in order from leaves to root) we will generate the optimal code for its subtree. To do this, we simply go through all the samples applicable at this vertex. The execution time when using a specific sample will be the sum of the time of calculating its arguments (and we already know the optimal code for calculating them thanks to the tree traversal order) and the execution time of the sample itself. From all the received options, we select the best one - it will be the optimal code for the subtree of this node. At the root of the tree we get the optimal code for the entire expression.

Code generation

It is not necessary to write out the code for all vertices - it is enough to write down the minimum required time and the sample that needs to be used. Everything else from this is easily restored.

In these tasks we have an infinite number of registers, so you can use a new one every time.

Construction of RF according to DFA Construction of NFA according to right-linear grammar Reduction of grammar

In order to convert an arbitrary KS-grammar to the given form, you must perform the following steps:

remove all fruitless characters;
remove all unreachable characters;

Removing useless characters

Input: KS-grammar G = (T, N, P, S).

Output: KS-grammar G’ = (T, N’, P’, S), not containing sterile symbols, for which L(G) = L(G’).

Method:

We recursively construct the sets N 0, N 1, ...

N 0 = ∅, i = 1.

N i = (A | (A → α) ∈ P and α ∈ (N i - 1 ∪ T)*) ∪ N i-1 .

If N i ≠ N i - 1, then i = i + 1 and go to step 2, otherwise N’ = N i; P' consists of the rules of the set P containing only symbols from N' ∪ T; G' = (T, N', P', S).

Definition: A symbol x ∈ (T ∪ N) is said to be unreachable in a grammar G = (T, N, P, S) if it does not appear in any sentential form of that grammar.

Example

Remove useless characters from grammar G((A, B, C, D, E, F, S), (a, b, c, d, e, f, g), P, S)

S → AcDe | CaDbCe | SaCa | aCb | dFg
A → SeAd | cSA
B → CaBd | aDBc | BSCf | bfg
C → Ebd | Seb | aAc | cfF
D → fCE | ac | dEdAS | ε
E → ESacD | aec | eFf

Solution

N0 = ∅
N 1 = (B (B → bfg) , D (D → ac) , E (E → aec) )
N 2 = (B, D, E, C (C → Ebd) )
N 3 = (B, D, E, C, S (S → aCb) )
N 4 = (B, D, E, C, S) = N 3

G"((B, C, D, E, S), (a, b, c, d, e, f, g), P", S)

S → CaDbCe | SaCa | aCb
B → CaBd | aDBc | BSCf | bfg
C → Ebd | Seb
D → fCE | ac | ε
E → ESacD | aec

Removing unreachable characters

Input: KS-grammar G = (T, N, P, S)

Output: KS-grammar G’ = (T’, N’, P’, S), not containing unreachable symbols, for which L(G) = L(G’).

Method:

V 0 = (S); i = 1.

V i = (x | x ∈ (T ∪ N), (A → αxβ) ∈ P and A ∈ V i - 1 ) ∪ V i-1 .

If V i ≠ V i - 1, then i = i + 1 and go to step 2, otherwise N’ = V i ∩ N; T’ = V i ∩ T; P' consists of the rules of the set P containing only symbols from V i ; G' = (T', N', P', S).

Definition: A KS-grammar G is said to be reduced if it does not contain unreachable and sterile symbols.

Example

Remove unreachable characters from grammar G"((B, C, D, E, S), (a, b, c, d, e, f, g), P", S)

S → CaDbCe | SaCa | aCb
B → CaBd | aDBc | BSCf | bfg
C → Ebd | Seb
D → fCE | ac | ε
E → ESacD | aec

Solution

V 0 = (S)
V 1 = (S, C (S → CaDbCe), D (S → CaDbCe), a (S → CaDbCe), b (S → CaDbCe), e (S → CaDbCe) )
V 2 = (S, C, D, a, b, e, E (C → Ebd), d (C → Ebd), f (D → fCE) )
V 3 = (S, C, D, E, a, b, d, e, f, c (E → ESacD) )
V 4 = (S, C, D, E, a, b, d, e, f, c) = V 3

G""((C, D, E, S), (a, b, c, d, e, f), P"", S)

S → CaDbCe | SaCa | aCb
C → Ebd | Seb
D → fCE | ac | ε
E → ESacD | aec