Usenet premium link generator is a cloud storage service which offers two accounts. It includes free as well as the premium account as well. You need a premium account to get access to all the features which are not available in the free version. It includes an unlimited number of uploads, high downloading speed and so on. Therefore, you need to use a premium account and I also suggest it. It is because it ensures your privacy and secure too.
Abstract:We explain what the 500 language problem is, why it is a relevantproblem, and why solutions are needed. We propose a solution, which israpid development of renovation parsers by stealing grammars. Weillustrate this by applying this approach to two non-trivial butrepresentative languages: a proprietary real-time language from thetelecommunications industry, and a well-known dialect of the mostpopular language in the world: IBM's VS Cobol II. We share thelessons we learned with our efforts to solve the 500 language problem. IntroductionCapers Jones estimates that there are at least 500 languages anddialects available in commercial form or in the public domain. On topof that, he estimates that some 200 proprietary languages have beendeveloped by corporations for their own use [1, p. 321]. Inhis book on estimating the costs of the Year 2000 problem [2]he furthermore indicated that systems written in all those 500 plus 200languages were affected. The findings of Jones inspired manyY2K whistle-blowers to mention his estimates as a major impediment tosolve the Year 2000 problem. Let us have a look at what these peoplehad to say. For instance, Ed Yourdon replied with a boilerplate emailwhen you send him mail containing the words: Y2K and solution. Hementions the 500 language problem, and it is worthwhile to quote thispart in its entirety:I recognize that there is always a chance that someone will come up witha brilliant solution that everyone else has overlooked, but at this latedate, I think it's highly unlikely. In particular, I think the chances ofa ``silver bullet'' solution that will solve ALL y2k problems is virtuallyzero. If you think you have such a solution, I have two words for you:embedded systems. If that's not enough, I have three words for you:500 programming languages. The immense variety of programming languages(yes, there really are 500!), hardware platforms, operating systems,and environmental conditions virtually eliminates any chance of a singletool, method, or technique being universally applicable. The number 500 should be taken poetically, like the 1000 in thepreserving process for so-called thousand-year-old eggs, which lastonly 100 days. For a start, the 200 proprietary languages should beadded, moreover other estimates indicate that 700 is ratherconservative: Weinberg estimated already in 1971 that in 1972programming languages will be invented at the rate of one per week--ormore, if we consider the ones which never make it to the literature,and enormously more if we consider dialects, too [3, p. 242].Also Peter de Jager created awareness for the 500 language problem. Hewrites about the availability of Y2K tools [4]:There are close to 500 programming languages used to develop applications.Most of these conversion or inventory tools are directed toward a verysmall subset of those 500 languages. A majority of the tools are focusedon Cobol, the most popular business programming language in the world.Very few tools, if any have been designed to help in the area of APL orJOVIAL for example. If everyone was using Cobol, and only a few systems were written inuncommon languages, the 500 language problem is of limited importance.Therefore, it is useful to know what the actual languagedistribution of installed software is. First, there are about 300Cobol dialects: each compiler product has a few versions, with manypatch levels, Cobol often contains embedded languages like DMS, DML,CICS and SQL. So there is no such thing as the Cobol language.Cobol is a polyglot: a confusing mixture of dialects and embeddedlanguages--a 500 language problem of its own. Second, Yourdon and deJager were right about the importance of the 500 language problem.Namely, 40% of all the software is written in less common languages. Tobe precise, the distribution of the world's installed software bylanguage according to Capers Jones is as follows:Cobol: 30% (225 billion LOC)C/C++: 20% (180 billion LOC)Assembler: 10% (140-220 billion LOC)less common languages: 40% (280 billion LOC) In contrast, for about 50 languages Y2K search engines existed, and forabout 10 languages there were automated repairengines [2, p. 325]. So only for a very small part of thelanguages there is automated modification support. This lack causedpeople to worry about the 500 language problem.What is the 500 Language Problem?What is it, and is it still relevant? We entered the new millenniumwithout too much trouble so you one could conclude that maybe there wasthis 500 language problem, but whatever it was, it is not relevantanymore. Of course the 500 language problem already existed before itwas popularized by the Y2K gurus, and it did not go away when weentered the new millennium. Capers Jones identified and namedthe problem. But what is a good description of this problem? Here's asuccinct formulation:The 500 language problem is defined as the most prominentimpediment for constructing tools to analyze and modify existingsoftware written in those languages. Removing this impediment solves the 500 language problem. Weillustrate what this impediment comprises. If you want tools toaccurately probe and manipulate source code, a prerequisite for suchtools is that the code has to be converted from text format into a treeformat. To make this conversion you need a so-called syntacticanalyzer, or a parser. Constructing a parser for analysis ormodification is a major effort, and in many cases the up-frontinvestment is hampering initiatives for commercial tool builders, whichclarifies the lack. Indeed, Tom McCabe told us that McCabe &Associates developed parsers for 23 languages, which was already a hugeinvestment. But 500 would be insurmountable, therefore, he dubbed the500 language problem ``the number one problem in softwarerenovation''.A sometimes heard solution for the 500 language problem is to justconvert from uncommon languages to mainstream ones for which toolsupport is available. Eliminating all these languages will make the500 language problem go away. This is not a solution. For, you need afull-blown tool-suite to make the conversion, including a seriousparser. And obtaining a parser is part of the 500 language problem.So language conversion will not eliminate the 500 languageproblem, on the contrary, you need a solution for the 500 languageproblem to aid in solving conversion problems.A second suggestion to solve the 500 language problem is reported onin Usenet discussions where a researcher proposed to generate grammarsfrom the source code only, in the same way linguists try to generate agrammar from a piece of natural language. In search of solutions, westudied this idea and consulted the relevant literature. We did notfind any successful effort where the linguistic approach helped tocreate a grammar for a parser in a cost-effective way. Our conclusionis that the linguistic approach does not lead to useful grammarinferences from which you can built parsers [5].Another suggestion to solve the 500 language problem is to reuse theparser from compilers. This solution works already a bit better: youjust tap the parser output from a compiler and feed it to arenovation tool. In fact, this is what Prem Devanbu is doing with hisGENOA/GENII system [6]. He developed a programmable tool thatcan turn a certain idiosyncratic output format of a parser into anotherformat that is more suitable for code analysis. There is however, onemajor drawback to this approach: as Devanbu points out in hispaper [6] the GENOA system does not allow for modifyingcode. This is not a surprise, since a compiler removes comments,expands macros, includes files, minimizes syntax, and thus irreversiblydeforms the original source code. The intermediate format is goodenough for analysis in some cases, but the code can never be turnedinto acceptable text format again. Another very real limitation isthat for many compilers you do not get access to its sources (foreconomic reasons). In renovation projects, such as Y2K, Euro,code restructuring, language conversion, and so on it is a requirementthat you can automatically modify code. Namely, the code volume isprohibiting effective and efficient renovation by hand.Summarizing, the availability of grammars is very sparse, and solutionsto obtain them are far from optimal. The 500 language problem is areal problem, it was a real problem before the Y2K gurus baptized it,and it did not go away after the millennium passed. Its solution is afirst step in the direction of enabling tool support for analyzing andmodifying our 800-900 billion LOC of existing software assets writtenin numerous languages. How We Are Cracking the 500 Language ProblemEd Yourdon claimed that the large number of programming languages wouldvirtually eliminate any chance of a single tool, method, or techniquebeing universally applicable. It turns out that the 500 languageproblem does have a single solution. And it is not too hard either.This is what we mean with the word solution:The 500 language problem is cracked when there is a cheap, rapid andreliable method to produce grammars for the myriads languages so thatanalysis modification of existing code is enabled. We explain what this all means. Cheap is in the 25.000 5.000 USdollar range, rapid is in the 2 weeks range (one person), and reliableis that the parser based on the produced grammar passes the test ofparsing millions of lines of code provided by the customer in need fortool support. Why is this a solution? For, a grammar is hardly a Euroconversion tool, or a Y2K analyzer. Next we explain that the mostdominating factor of constructing renovation tools is constructing theunderlying parser.From Grammar to Renovation ToolRenovation tools routinely comprise the following main components:preprocessors, parsers, analyzers, transformers, visualizers, prettyprinters, and postprocessors. In many cases, language-parametrized (orgeneric) tools are available to construct these components. Think ofparser generators, pretty printer generators, graph visualizationpackages, rewrite engines, generic data flow analyzers, and the like.Workbenches providing this functionality are for instance, Elegant,Refine, ASF+SDF, but there are many more. This is the genericcore of all renovation tools. In Figure 1 we depicted howyou go from a grammar to actual renovation tools. Figure 1:Effort shift for renovation tool developmentWe expressed effort by the length of arrows (longer arrows imply moreeffort). As you can see, if you have a generic core, and a grammar, itdoes not take too much effort to construct parsers, tree walkers,pretty printers, and so on. Although these components depend on aparticular language, their implementation uses generic languagetechnology: a parser is generated using a parser generator, a prettyprinter is generated using a formatter generator [7], andlikewise, tree walkers for analysis or modification are generatedsimilarly [8]. What all these generators share, is that theyheavily rely on the grammar. Once you have the grammar and therelevant generators you can rapidly set up this core for developingsoftware renovation tools. You could call this thegrammar-centric approach. Leading Y2K companies indeedconstructed generic Y2K analyzers so that dealing with a new languageideally reduced to constructing a parser. The bottleneck is obtainingcomplete and correct grammar specifications. The dashed part inFigure 1 expresses the current situation: it takes a lot ofeffort to create those grammars. In Table 1 we quantify theeffort for a typical Cobol renovation project using the solution wepropose. We discuss the project shortly, but first notice that thegrammar part took two weeks. Implementing a quality Cobol parser cantake 2 to 3 years, as Vadim Maslov of Siber Systems posted on Usenet(he constructed Cobol parsers for about 16 dialects). Also adapting anexisting Cobol parser to cope with new dialects takes easily 3 to 5months as we learned from several estimates done by others. Moreover,patching existing grammars using mainstream parser technology leads tounmaintainable grammars [9,10] significantly increasingthe time it takes to adapt parsers. Using our approach this effort isreduced significantly (in this example to 2 weeks of effort), so thatyou can much more quickly start developing actual renovation tools.To illustrate how to go from a grammar to an actual renovation task, webriefly describe this Cobol renovation project [11] where othersapplied our grammar-centric approach. This project concerned one ofthe largest financial enterprises in the world. They needed anautomatic converter from Cobol 85 back to Cobol 74 (the 8574 project).The Cobol 85 code was machine generated from a 4GL tool (KEY) so theproblem to convert back was fortunately limited, due to the limitedvocabulary of the code generator. It took some time to find solutionsfor intricate problems such as, how to simulate Cobol 85 features likeexplicit scope terminators (END-IF, END-ADD), or to expressthe INITIALIZE statement in the less rich Cobol 74 dialect. Thesolutions were discussed with the customers and tested forequivalence. Once these problems were solved, it was not much work toimplement the components due to the generic core assets generated fromthe recovered Cobol 85 grammar. The problem could be cut in 6 separatetools taking 5 days to implement. The programming by hand was limited(less than 500 LOC); but compiled into about 100.000 lines of C codeand 5000 lines of makefile code (linking in all the generated genericrenovation functionality). After compilation to 6 executables (2.6 Mbeach), it took 25 lines of code to coordinate them into a distributedcomponent-based software renovation factory that converts Cobol 85 codeat a rate of 500.000 LOC/hour back to Cobol 74 using 11 SunWorkstations. Table 1:Effort for the 8574 project. 8574 projecteffortgrammar2 weeksgeneration1 day6 tools5 daysassemblage1 hourtotal3 weeks Measuring this and other projects, it became clear to us that the totaleffort of writing a grammar by hand is orders of magnitude larger thanconstructing the renovation tools themselves. So the most dominatingfactor in producing renovation tools is constructing the parser.Building parsers using our approach reduces the effort to the sameorder of magnitude as constructing the renovation tools. Buildingparsers in turn is not hard: use a parser generator. But the input ofa parser generator is a grammar description. So the most importantartifacts that we need to enable tool-support for software renovationare complete and correct grammars. When we find an effective solutionfor producing grammars quickly for many languages we solve the largestimpediment to construct tools for those languages, and we thus solvethe 500 language problem.But how to produce grammars quickly? Recapturing the syntax of anexisting language is usually done by hand: take a huge amount ofsources, manuals, books, and a parser generator and start working. We,and many others, have worked like this for years. But then we realizedthat this hand-work is not necessary. Since we are dealing withexisting languages, the grammars are already constructed. This is whatwe discovered about grammars:do not create them, steal them, and then massage them to your needs. Grammar Stealing Covers Almost All LanguagesWe commence with an important argument showing that our approach coversvirtually all languages: we found two actual problematic cases. Wediscuss the coverage diagram depicted in Figure 2. Figure 2:Coverage diagram for grammar stealingRecall that we need to produce grammars for existing software,e.g., legacy systems. So the deployed software is compilable (or canbe interpreted). After passing the start here box, we enterthe compiler sources diamond. There are two possibilities:either the source code of the compiler is available to you or not.First we discuss the yes path. Then the only thing you have todo is find the part that turns the text into an intermediate form.That part now contains the grammar. This you can do to grep thecompiler sources for keywords of the language.There are three possibilities: either the grammar part is hard-coded,or a parser generator is used, or both (in a complex multi-languagecompiler for instance). We only need to cover the first two cases(they are present in the diagram). In the hard-coded case, you have toreverse engineer the actual grammar from the hand-written code.Fortunately, in the comments of such code often BNF rules are providedgiving you an indication what the grammar comprises. Moreover,compiler construction is a well-understood subject: there is even areference architecture known. Therefore, compilers are oftenimplemented with well-known implementation algorithms. So usually thequality of a hard-coded parser is good, e.g., a recursive descentalgorithm is used. In such cases you can easily recover the grammareither from the code, or the comments, or both. In one case we knowthat the grammar is not easily extractable: this is the case for thelanguage perl [12]. In all the other cases weencountered, the quality of the code was always sufficient to recoverthe grammar.If the parser is not hard-coded, it is generated (the BNF branch inFigure 2). But then there must be some BNF description ofit in the compiler sources. With a simple tool that parses the BNFitself you can extract the BNF. So in all cases when we have thecompiler sources we can recover the grammar, except for perl.This finishes the case when you have access to the source code of acompiler. Later on we discuss a published recovery case from compilersources to give you an idea of grammar stealing in case the compilersources are to your avail.Now we look at the case were there is no access to the compiler sources(we enter the language reference manual diamond inFigure 2). In that case there are two possibilities:there is a language reference manual or not. Let us first discuss thecase that a language reference manual is available. This can be acompiler vendor manual or an official language standard. There arethree possibilities: either the language is explained by example, orusing general rules, or both. We only need to treat the first twocases. Let us first assume that there are general rules. Then thereis the quality issue. Reference manuals and language standards areknown to be full of errors. To our surprise, we discovered that themyriads of errors are of a repairable category. We were surprisedsince we experienced a total failure to recover a grammar from a manualin 1998, a proprietary language for which--obviously--also thecompiler sources were available (so this case is covered with theupper-half of Figure 2). As you can see in the coveragediagram we have not found low quality language reference manualscontaining general rules for cases where we did not have accessto compiler sources. This is clarified as follows: compiler vendorsdo not give away the source code of a compiler for economic reasons.But in order to be successful as a company accurate documentationexplaining the entire language is necessary. We discovered that thequality level of those manuals is good enough to recover the grammar.Later on we discuss a published recovery case from a language referencemanual to give you an idea of grammar stealing in case there are nocompiler sources are to your avail.For an uncommon language it is much more rare to have a high qualitymanual: either there is none (e.g., if the language is proprietary),or the company has only a few customers. In the proprietary case youhave the compiler sources, so we have coverage by the upper half of thediagram. In the other case you can buy the sources since theirbusiness value is not too high. For instance, when Wang went bankrupttheir important clients bought the sources of the Wang operating systemand the Wang compilers to create their own platform and dialectmigration tools. This explains why we do not know of low qualitymanuals containing general rules. We know of one case where thelanguage is explained by code examples and where general rulesare absent. This is RPG. We think that it is possible to extract thegeneral rules of RPG from the code examples in a systematic manner. Weplan to examine this case in more detail in a future renovation projectinvolving a large amount of RPG code.Finally, we have to deal with is the case without access to thecompiler, and without access to a language reference manual. We havenot (yet) seen those cases. Capers Jones mailed us that: ``For asignificant number of applications with Y2K problems, the compilers mayno longer be available either because the companies that wrote themhave gone out of business or for other reasons.'' But he did not comeup with actual examples. We just mentioned the Wang case, where youcould buy the sources, and hence solve the problems using the upper halfof Figure 2. But we do not exclude that occasionally sucha thing can happen. In any case, a lesson that can be learned fromthis is: contracts between you and a business-critical language vendorshould include a solution for source access in case of bankruptcy orterminated support (we have seen examples where the sealed sources weregiven to key customers).Summarizing, our coverage diagram shows that virtually all grammars arein a class where you can recover the grammar as we will see later on.But What About Semantics?Some people think that you need up-front in-depth knowledge of thesemantics of a programming language in order to change code. When theBNF is recovered, you can generate a syntax analyzer that producestrees, but the trees are not decorated with extra knowledge, such ascontrol-flow, data-flow, type annotation, name resolution, and so on.Some people also think that you need a lot of semantical knowledge toanalyze and modify existing software. We experienced that this is nottrue. There are three levels on which you can try to capturesemantical knowledge of a language:for all compilers of a language (think of different dialects);for one compiler product only;on a project by project basis. Recall that the 500 language problem is about removing the impedimentto build tools that work on existing software. So we are nottrying to build a compiler, in which all semantical knowledge has to beimplemented. Consider the following Cobol excerpt:PIC A X(5) RIGHT JUSTIFIED VALUE 'IEEE'.DISPLAY A.The OS/VS Cobol compiler prints the expected result, namely 'IEEE', which is right justified. The Cobol/370 compiler displays theoutput 'IEEE ' with a trailing space, which is leftjustified. There are many more such cases, so trying to deal with thesemantics of all compilers up-front is not feasible. Even when yourestrict yourself to one compiler, this problem does not go away.Consider the Cobol fragment below:01 A PIC 9999999.MOVE ALL '123' to A.DISPLAY A.Depending on the compiler flags, this code displays the number 3123123or the entirely different number 1231231. There are hundreds of suchproblems so also for a single compiler it is infeasible to capture thesemantics up-front. So there is no single semantics available, andgathering all variants is prohibitively expensive, and prone to errorgiven the semantical differences between compilers, compiler versions,and even compiler flags used.The good news is that you can deal with semantics on a per projectbasis. It is our experience that you only need specific ad hocelements of the semantics. We call this demand drivensemantics. Let us explain. For instance, the NEXT SENTENCE inCobol directs control to the statement after the next separation period(denoted with a dot). So, depending on where people put an optionaldot, the code jumps directly behind the dot. Omitting a dot can leadto different behavior. One of our customers wanted tools to get rid ofthis implicit jump instruction. Indeed you have to do an in depthanalysis of what disasters can possibly happen. Luckily, it turned outthat for this project, the implicit jump instruction could be replacedby the innocent no-op CONTINUE. So, after our semanticalinvestigation, we knew we could safely ignore the hazardous jumpingbehavior, and use a relatively dumb tool making this change. Inanother project this tool might break down: depending on the sourcecode or the semantics of the compiler. To give you an idea how far youcan come with per project demand driven semantics: we developed forsome Cobol systems relatively dumb tools able to wipe out very complexGO TO logic [13]. Overall it is our experience thatwith many and diverse tasks you do need to know the semantics, but itis not up-front necessary to encode this in a parse tree or otherwise.The tool developer uses the knowledge to construct the proper (mostly)syntactic tools.How Grammar Stealing Works in PracticeWe--and others from industry and academia--have applied grammarstealing successfully to a number of projects among which Java, PL/I,Ericsson PLEX, C++, Ada 95, VS Cobol II, Lucent SDL, AT&T SDL, SWIFTmessages, and more. To show how the solution works out in practice weshare our experience with these projects, in particular, we focus onthe two main branches in diagram 2: one proprietarynontrivial real-time embedded system language (for which the compilersources were accessible) and one well-known business language at theother end of the gamut (for which no compiler sources were available).Both languages, PLEX and VS Cobol II, are used for business-criticalsystems. PLEX (Programming Language for EXchanges) is used in theAXE 10 public branch exchange, and VS Cobol II systems run on IBMMainframes (AXE 10 is the name of the switch, it is not anabbreviation).Our approach uses a unique combination of powerful techniques. Theyare:automated grammar extraction;sophisticated parsing techniques;automated testing;automated grammar transformation support. If one of the above ingredients is omitted the synergy is gone.Extraction by hand is error prone. Basic parsing technology limits youto work with grammars in severely limited formats. With powerfultechnology you can work with arbitrary context-free grammars, so youcan directly test them irrespective of their format. Without automatedtesting you never find so many errors in a short time. Without toolsupport to transform grammar specifications, analyses are inaccurate,corrections are not done consistently, and without transformations youcannot repeat what you have done, or change initial decisions easily(for we record transformations in scripts). This also gives you anidea of what kind of people are capable of stealing grammars. Theyshould know about grammars, powerful parsing techniques, how to set uptesting, and know about automated transformations.Grammar Stealing from Compiler SourcesWe illustrate grammar stealing with a published industrialproject [14] with access to the compiler sources. We appliedour approach to the exceptionally complex proprietary language PLEXused by Ericsson to program public telephone switches. PLEX consistsof about 20 (sub)languages, called sectors. In fact, we are dealingwith a mixed language containing high level programming sectors,assembly sectors, finite state machine sectors, marshaling sectors,etc. What we did is straightforward, and can be summarized in a list:1.we reverse engineered the PLEX compiler on-site (63 Megsource code) to look for grammar related files; 2.we found the majority of the grammars in some BNF form; 3.we found a hand-written proprietary assembly parser with erroneous BNF in the comments; 4.we wrote 6 BNF parsers (there were 6 different BNF dialects used); 5.we extracted the plain BNF from the compiler sources, and converted it to another syntax definition formalism (SDF) for technical reasons; 6.we found the lexer files and converted them to SDF; 7.we combined all the converted grammars into one overall grammar; 8.we generated an overall parser with a sophisticated parser generator; 9.we successfully parsed 8 million lines of PLEX code as a test. The total effort was 2 weeks for two persons, including constructingtools, testing time, etc. It was done for 25.000 US Dollar. We heardfrom Ericsson that some cutting edge reengineering company earlierestimated this task at a few million US Dollar. When we contacted thiscompany they told us that 25.000 US Dollar was nothing for sucha grammar.To illustrate the limited complexity of the work, consider a fragment ofraw compiler source: = 'END' 'PROGRAM' ';' %% xnsmtopg(1) ; %%-- -- PROGRAM-HEADER.sect.as_prog_stat : ix_stat_list_p-- PROGRAM-HEADER.sect : ix_sect_node_p) ;This expresses that a PLEX program consists of a header, a list ofstatements, and the phrase END PROGRAM to end a PLEX program.The other code deals with semantic actions relevant for the compiler.Our tools converted this to some common BNF while removing theidiosyncratic semantic actions:plex-program ::= program-header statement-row 'END' 'PROGRAM' ';'Then our tools converted this into SDF, which was subsequently fed to asophisticated parser generator accepting arbitrary context-freegrammars:Program-header Statement-row "END" "PROGRAM" ";" -> Plex-programWe only show one production rule to give you an idea of the lowcomplexity. The tools automatically recovered the majority of the3000+ production rules in an afternoon. Then we tested each sectorgrammar separately, we used a duplicate detector to weed outproductions that were used in more than one sector grammar, so that wecould construct an overall grammar able to parse complete PLEXprograms. One assembly sector parser was hard-coded (viz. Figure 2), so we had to recover its grammar by reverseengineering. We had no problems with this task. The commentsaccompanying the code contained BNF, so this made the effort verylimited. With all the sectors grammars combined, we generated a parserto test it with an 8 MLOC PLEX test-suite. Two files did not parse:they were compiler test-files that were not supposed to parse. Therest passed the test. In addition we generated a web-enabled versionof the BNF description as a basis for a complete and correct manual.Although this project was successful [14], an earlier(published) attempt to recover the PLEX grammar failed. In a firstattack [15,16] we failed to recover the PLEX grammarfrom on-line PLEX manuals. Those manuals were not good enough toreconstruct the language from. Later, we could check that the manuallacked over 50% of the language definition, so that the recoveryprocess had to be incomplete by definition. We recall that initiallywe concluded that you cannot recover grammars from manuals. But weconcluded this too fast, as we will see next.Grammar Stealing from Reference ManualsWe illustrate grammar stealing with a published industrialproject [5] without access to the compiler sources. Some ofour colleagues felt a little fooled by the PLEX result: ``they are notreally constructing a parser, they only convert an existing one. Heywe can do that, too! Now try it without the compiler.'' Indeed, atfirst sight, not having this valuable knowledge source availableis a different issue (after all, we failed doing this for PLEX).However, we discovered that this failure was not due to the tools thatwe developed, but due to the nature of proprietary manuals: itsaudience is so limited that major omissions can go unnoticed for a longtime. When there is a large audience the language vendor has todeliver better quality.In another two-week effort we recovered the VS CobolII [17] grammar from a manual, by also stealing thegrammar. For the fully recovered VS Cobol II grammar browse to: www.cs.vu.nl/grammars/vs-cobol-ii/. Again, what we did isstraightforward, and can be summarized in a list: Figure 3:The original syntax diagram for the SEARCH statement.1.we retrieved the on-line VS Cobol II manual from www.ibm.com;2.we extracted its syntax diagrams; 3.we wrote a parser for the syntax diagrams; 4.we extracted the BNF from the diagrams; 5.we added 17 lexical rules by hand; 6.we corrected the BNF using grammar transformations; 7.we generated an error-detection parser; 8.we incrementally parsed 2 million lines of VS Cobol II code 9.we reiterated step 6-8 until all errors vanished; 10.we converted the BNF to SDF for technical reasons; 11.we generated a production parser; 12.we incrementally parsed VS Cobol II code to detect ambiguities; 13.we solved ambiguities using grammar transformations; 14.we reiterated steps 11-13 until no more ambiguities were found. So apart from some error correction cycles and ambiguity removalsessions, the process is the same as in the case when you have accessto the compiler sources. An error-detection parser is a parserto detect errors in the grammar it is generated from. In this case, weused an inefficient top-down parser with infinite lookahead. Itaccepts practically all context-free grammars, and does not botherabout ambiguities at all. We use this kind of parser to test thegrammar, not to produce parse trees. Since the code is correctaccording to some compiler, all errors this parser detects raisepotential grammar problems. In this way we were pointed to themajority of omissions, except ambiguities. When all our test codepassed the top-down parser, we converted the grammar to SDF andgenerated a parser that detects ambiguities. In the same vein wecorrected ambiguities. This project also took us two weeks ofeffort (two persons) in total, including the construction of tools,testing, and so on. It was done for zero US dollars. In that way wecould freely publish the grammar on the Internet [18], as a giftfor the 40th birthday party of Cobol.To give you an idea of the limited complexity of this effort, wedepicted an original syntax diagram [17] inFigure 3. After conversion to BNF, and correctionit looks like this:search-statement = "SEARCH" identifier ["VARYING" (identifier index-name)] [["AT"] "END" statement-list] "NEXT" "SENTENCE")+ ["END-SEARCH"]A dash is removed between NEXT and SENTENCE. Furthermore,both occurrences of imperative-statement are replaced by statement-list. This is an example of a diagram that was toorestrictive: only one statement was allowed but in the informal text welearned that: ``A series of imperative statements can be specifiedwhenever an imperative statement is allowed.'' Both errors were foundusing our error-detection parser: we parsed code where NEXTSENTENCE was used but without a dash. Upon inspection of the manualand grammar we wrote a grammar transformation repairing the error. Theother error was also detected with our error-detection parser: weparsed code where more statements were allowed by the compiler but notby the manual. We repaired the error with a grammar transformation.After all these errors were corrected, we removed ambiguities in aseparate phase. We illustrate this phase with an example. In theCobol CALL statement the following fragment of a syntax diagramis present:_____________identifier_____________________ __ADDRESS__OF__identifier__ __file-name________________The above stack of 3 alternatives can lead to an ambiguity. Namely,both identifier and file-name eventually reduce to the samelexical category. So when we parsed a CALL statement without anoccurrence of ADDRESS OF the parser reported an ambiguity sincethe other alternatives were both valid. Without using type informationwe cannot separate identifier from a file-name. We first showyou the ambiguous extracted BNF fragment. (identifier "ADDRESS" "OF" identifier file-name)With a grammar transformation we eliminated the file-namealternative. (identifier "ADDRESS" "OF" identifier)With the adapted grammar the same language is recognized as before,only an ambiguity is gone. Note that this approach is much simplerthan tweaking the parser and scanner to deal with types of names. Inthis way we recovered the entire VS Cobol II grammar and tested it withall our Cobol code from earlier software renovation projects and codefrom colleagues who were curious to the outcome of this project. Forthe final test we used about 2 million lines of pure VS Cobol II code.As in the PLEX case, we generated a fully web-enabled version of boththe corrected BNF and the syntax diagrams that could serve as the corefor a complete and correct language reference manual. It is freelyavailable on the Internet [18]. ConclusionWe explained what the 500 language problem is and how it can be solvedin general by showing that our method covers almost all languages. Weillustrated the general solution for two very complex andrepresentative cases. One proprietary language with access to thecompiler sources (PLEX), and one well-known business language with onlyaccess to a reference manual (VS Cobol II). Both cases are backed upwith publications in the scientific literature. We sketched acost-effective and accurate method to quickly produce parsers for themyriad of languages so that existing code can be analyzed and modifiedusing tools that work on the trees produced by the parsers. Weillustrated our solution by providing details of the process for PLEXand Cobol. Apart from those two languages more grammars were recoveredby us and others. From our effort in solving the 500 language problemwe learned two interesting lessons:The more uncommon a language is, the more chance that you havedirect access to the compiler sources which is an excellent source forgrammar recovery.The more mainstream a language is, the more chance that you havedirect access to a reasonably good language reference manual that isdebugged by its many users which is an excellent source for grammarrecovery.0.0.0.1 Acknowledgements Thanks toTerry Bollinger,Prem Devanbu,Capers Jones,Tom McCabe,Harry Sneed,Ed Yourdon,and the reviewers for their substantial contributions.Bibliography
Cracked Usenet.nl Account Generator
Download Zip: https://shoxet.com/2vzTH2
2ff7e9595c
Comments