Chapter 3 - REGULAR EXPRESSIONS AND REGULAR LANGUAGES Thus far we have described languages either by an English sentence or set-builder notation or by listing the elements. We need a concise notation for describing languages. In the following we define a type of simple language which can be described by a concise notation - a regular expression. (See page 72) Def. Let E be an alphabet. The class R of regular languages over E, and the corresponding regular expressions are defined by: 1. phi ( the empty-set symbol ) is a regular language expression and the corresponding regular expression is phi. 2. {lambda} is a regular language, and the corresponding regular expression is lambda. 3. For each a in E, {a} is a regular language, and a is the corresponding regular expression. 4. If L1 and L2 are regular languages and r1 and r2 are the corresponding regular expressions, then (a) L1 + L2 (set union) is a regular language with the regular expression (r1 + r2). (b) L1L2 is a regular language with the regular expression (r1r2). (c) L1* is a regular language with the regular expression (r1*). i.e. a new reg exp can be formed by applying the * operator to a reg exp a new reg exp can be formed by concatenating or writing + between two reg exp Note that 1 and 2 are trivial cases: 1. says that a an empty set of strings is regular. 2. says that a set that contains only one string - the empty string - is regular. Ex. ((a+b)* + (ab+bb)) is a reg exp ((a+) is not a reg exp Ex (a+b)*(ab) denotes all strings that end with ab Ex (a+b)^2 + (ab)* denotes the union of (1) all strings of length 2 with (2) all strings of 0 or more ab Ex a(a+b)*b + b(a+b)*a denotes all strings whose first and last symbols differ. Ex (ab + b)* (b + ab)^2 denotes the concatenation of (1) all strings made up of ab and b with (2) strings formed by two concatenations of b and ab (1) = {lambda,b,bb,ab,abb,bab,bbb,abab,...} (2) = {bb,bab,abb,abab} Precedence: Just as in arithmetic expressions, we have a hierarchy which allows us to write expressions without fully parenthesizing. Hierarchy : arithmetic regular expressions -------------------------------------- parentheses parentheses exponentiation Kleene star * multiplication concatenation addition union + hierarchy for regular expressions b+cb = b+(cb) { concatenation has higher priority } b+ab* = b+a(b*) { the star has higher priority than concatenation } EXAMPLE (0+1)*00(0+1)* This denotes all strings containing the substring 00 Reg. expr. for all strings without the string 00 ((01 + 1)*) (0+lambda) Note that there is no clear relationship between the regular expression for a language and a regular expression for its complement. EXAMPLES Pascal Identifiers Remember: Any string of letters and/or digits beginning with letter. letter(letter + digit)* where letter = a+b+c...+z+A+B+C...+Z digit = 0+1+2..+9 Pascal Integer ('+' + '-' + lambda)(digit)(digit)* Pascal Real ('+' + '-' + lambda) digit digit* (lambda + '.' dig dig*) ((lambda) + ('E' (lambda + '+' + '-') digit digit* ) EXAMPLE (a + bc)* (c + lambda) = {lambda, a, c, aa, ac, bc, ...} set of all strings made up of a or bc, with an optional c at the end. EX: Let L be the language of the following regular expression. c* (b + (ac*))* Is cba in L? yes Is ccbaac in L? yes Is cbc in L? no, we can't get a second c without an a. Is acabacc in L? yes Is bbaaacc in L? yes EX: Given a finite language, find a regular expression. L = {a, bcc, acb} reg. expr: a + bcc + acb Clearly, there is a reg. expr. for any finite language. That reg. expr. is created by the union of all the elements of L. Thus, any finite language is regular. EX: E = {a,b} L = {w : w = (a^n)(b^n) , 0<=n<=3 } ab + aabb + aaabbb is a regular expression for L. This language is regular because it is finite. EX: E = {a,b} L = {w : w = (a^n)(b^n) , 0<=n } a* b* does NOT work, because it includes (eg) aab (ab)* does NOT work, because it includes abababab... There is no regular expression for this language. ( proved later ) EX: L = the set of all palindromes over {a,b} for example: a abba bab bb We cannot write a regular expression for this language. We could write a regular expression for a finite set of these, but not for the set of all palindromes. Thus, the language of palindromes is not regular. ( proved later ) FINITE AUTOMATA Example: Newspaper Vending Machine The control device for a newspaper vending machine keeps track of the amount of money ( nickels, dimes, quarters ) that has been entered, and allows the door to be opened if the amount is at least 25 cents. We can think of the amount entered as the state of the machine. We move to a new state if another coin is entered using the following transition table: Delta table for newspaper vending machine: | n d q --+----------- 0| 5 10 25 5| 10 15 25 10| 15 20 25 15| 20 25 25 20| 25 25 25 25| 25 25 25 e.g. If we are in state 10 and an n (nickel) is entered, we move to state 15. We except the input string if we end up in state 25. For example, we accept the string dnd. We reject the string dnn. Note: The symbols used to denote the states in a machine are arbitrary. Hereafter, the states will always be labeled 0,1,2,... A FA ( finite automata) is given by: (Q, E, q0, d, F) where Q = finite set of states E = finite input alphabet q0= start state d = transition function A = set of final states d : QxE -> Q (Given a state and an input symbol, outputs a new state) Extended Transition Function: d* d* : QxE* -> Q (Given a state and a string, outputs a new state) Recursive Definition: d* (q, lambda) = q If w = va then (a is the last character, v is everything before it) d* (q, w) = d* (q, va) = d(d*(q,v),a) Note that the definition implies that d*(q, a) = d(q, a) { To see this, let v = lambda } Def - The language of a FA: Let M be a FA, then L(M) = {w : d*(0,w) isin A } The extended transition function could be recursively defined in C as follows. We assume that we have two easy string functions: lastsymbol(string) // returns the last symbol in a string allbutlastsymbol(string) // returns the string with the last symbol stripped off - string can't be empty C Language Version of deltastar // deltastar takes a state and a string as input and returns a state state deltastar( state q, string v) { if ( v == lambda) return q; char a = lastsymbol(string); string u = allbutlastsymbol(v); state r = deltaStar(q,u); // r is the state u takes us to return( delta(r,a)); // the usual delta fun from QxE -> Q } EXAMPLE |a b { The states are 0 and 1. The input symbols are a and b. } --+--- A = { 1 }. 0 |0 1 1 |0 1 The computation below is indicated by a sequence of configurations of the form [ current state, remaining string ]. [0,aba] -> [0,ba] -> [1,a] -> [0, lambda] aba is not accepted because input is exhausted, and we are not in a final state. L(M) = {w in E* | w ends in b} EXAMPLE |a b A = { 3 } --+--- 0 |2 1 1 |1 1 Note that state 1 is a trap state. 2 |3 2 3 |3 2 L(M) = the set of all strings which start and end with a. This can be described concisely by the regular expression a ( a + b )* a Computation: [0,aaba] -> [2, aba] -> [3,ba] -> [2,a] -> [3,lambda] accepted Sequence of states: 0, 2, 3, 2, 3 This is called a path. If no transition is specified for a [state, symbol] then the FA is incompletely specified and a string is rejected if it reaches a state where there is no transition for the next symbol. Note: The number of states in a path is |w| + 1. Thus, if the length of the string is at least the number of states, a state must be repeated i.e. the path must have a loop. EXAMPLE |a b A = { 2 } --+--- 0 |0 1 1 |0 2 2 |2 2 L(M) = the set of all strings which contain the substring bb. This can be described concisely by the regular expression ( a + b )* bb( a + b )* Note that we can construct an FA for the complement language by simply interchanging acceptiing and non-accepting states. Thus, if we change A to A = {0,1}, the language is the set of all strings which do not contain bb. Clearly this is true in general. In this way we can always transform an FA for a language L into an FA for L'. Finite state machines with output ( This topic is not in the text. ) 1 Mealy Machines: transition output (output during transition) 2 Moore Machines: state output (output when you get to a state) EXAMPLE: Adder | (0,0) (0,1) (1,0) (1,1) --------+------------------------------- No Carry| NC/0 NC/1 NC/1 C/0 Carry| NC/1 C/0 C/0 C/1 i.e. If we are in state NC and the input is (0,0), we stay in NC and output 0. The string consists of pairs and is processed from least to most significant digit pair. After the last pair, if we are in the carry state, we append a 1 to the output string. EXAMPLE: FA for L={w in {0,1}* : 001 is a substring of w} | 0 1 ------+------------- 0 | 1 0 1 | 2 0 2 | 2 3 3 | 3 3 For w = 10100 path = 0, 0, 1, 0, 1, 2 rejected For w = 10010 path = 0, 0, 1, 2, 3, 3 accepted If we replace A by A', then the new machine accepts the complement. EXAMPLE: L={ab,aabb,aaabbb}={a^i b^i : 1<=i<=3} | a b -+------ A = { 6 } 0| 1 ? 1| 2 6 2| 3 5 3| ? 4 4| ? 5 5| ? 6 6| ? ? What about {a^i b^i : 1<=i} ? ( There is no limit to the i ) Can't be recognized by DFA. (Not regular) EXAMPLE: Pascal Reals (The usual definition is that a real constant must have a . or an E or both. Integers not accepted) | dig sig . E where dig=any digit, sig=arithmetic sign (+,-) -+---------------- 0| 2 1 1| 2 2| 2 3 5 3| 4 A = { 4, 7 } 4| 4 5 5| 7 6 6| 7 7| 7 Rejected Accepted 7. 3.4 E42 +2.6 + -7.3 ++6 7.4E7 +-7 +7.26E-3 -8.3E+12 Note : Float constants in C are of one of the following forms: ( Here ip is integer part, fp is fractional part, ep is exponent part ) ip.fp ep e.g. 6.24E+9 ip.fp e.g. 3.14 ip. e.g. 7. .fp e.g. .32 .fp ep e.g. .16E-5 ip ep e.g. 4E12 Distinguishing Strings With Respect to a Language Example: If we look back at the FA which accepts strings which have 001 as as substring, we see that the 4 states can be thought of as (0) no progress so far (1) one 0 has been encountered (2) 00 has been encountered (3) 001 has been encountered ( accept ) As we scan a string, if 00 is a prefix, we have progressed farther than if 10 is a prefix. We will say that these two strings are distinguishable because, if the rest of the string is 1, we will accept in one case, and reject in the other. Our FA must have enough states so that the distinguishable strings will lead us to different states. In this case, we say that the string consisting of 1 distinguishes 00 from 10. We can use this idea to classify strings with respect to a language: Definition 3.5: (Re-phrased) Two strings x and y are indistinguishable with respect to L if appending any string z to x and y produces a pair of strings that are either both in L or both not in L. If there is any string z such that exactly one of xz and yz is in L, then x and y are distinguishable with respect to L. Given two string u and v, if uz is in L and vz isnotin L, or if uz isnotin L and vz isin L, we say that z is a distinguishing string for u,v. Thus, strings u and v are indistinguishable with respect to L if there is no distinguishing string. Note that the definition implies that, if x isis L and y isnotin L, then x and y are distinguishable. ( Let z = lambda ) Also note that all strings which are not prefixes of strings in L are indistinguishable. i.e. If x and y are not prefixes of strings in L, then xz and yz are both not in L for any z. These strings are strings that lead us to a trap state in a FA. Referring to the example above, the following are indistinguishable with respect to the language: (1) lambda, 1, 11, 101 Also the following are all indistinguishable: (2) 00, 100, 0100, 110100 Note that any string starting with 1 will distinguish any string in (1) from any string in (2). Example: Let L = { a^n b^n : n >= 0 } Then a^i is distinguishable from a^j if i differs from j. The next theorem connects the number of states in a FA to the number of mutually distinguishable strings in the corresponding language. Theorem 3.2: If L has n mutually distinguishable strings, then a finite automata recognizing L must have at least n states. The idea of the proof is that distinguishable strings must lead us to distinct states. If x and y lead us to the same state, then, clearly, xz and yz will lead us to the same state. It follows immediately that a language that contains infinitely many mutually indistinguishable strings can't be represented by a finite state machine. In theorem 3.4 we assume the as yet unproved assertion that regular languages are exactly those languages that can be recognized by finite automata. Theorem 3.4 could be re-stated without this assumption as follows: Theorem 3.4 : Let M1 and M2 be FA whose languages are L1 and L2. Then we can construct machines M3, M4, M5, and M6 such that: L(M3) = L1 + L2 L(M4) = L1 * L2 L(M5) = L1 - L2 L(M6) = L1' To construct M6, we simply change A1 to Q1 - A1. To construct M3, M4, and M5, we use Q1XQ2 as our state set, define transitions from state pair to state pair in the obvious way, and define the set of accepting states to be {(p,q) : p isin A1 or q isin A2 } for M3, {(p,q) : p isin A1 and q isin A2 } for M4, {(p,q) : p isin A1 and q isnotin A2 } for M5. Example: M1: | a b A1 = { 0 } -------------- 0 | 1 0 1 | 0 1 M2: | a b A2 = { 2 } -------------- 0 | 0 1 1 | 0 2 2 | 2 2 Let M have states labeled by Q1XQ2: | a b --------------------- 0,0 | 1,0 0,1 0,1 | 0,0 1,1 0,2 | 1,2 0,2 1,0 | 0,0 1,1 1,1 | 0,0 1,2 1,2 | 0,2 1,2 For L1 + L2, A = {(0,0),(0,1),(0,2),(1,2)} For L1 * L2, A = {(0,2)} For L1 - L2, A = {(0,0), (0,1)}