µ±Ç°Î»ÖãºLinux½Ì³Ì - Linux - ¿É°®µÄ Python£ºÊ¹Óà Spark Ä£¿é½âÎö

¿É°®µÄ Python£ºÊ¹Óà Spark Ä£¿é½âÎö

David Mertz, Ph.D.£¨[email protected]£©
·ÖÎöÔ±£¬Gnosis Software, Inc.
2002 Äê 8 ÔÂ

Spark ÊÇÒ»ÖÖÓà Python ±àдµÄÇ¿´óµÄ¡¢Í¨ÓõĽâÎöÆ÷£¯±àÒëÆ÷¿ò¼Ü¡£ÔÚijЩ·½Ã棬Spark ËùÌṩµÄ±È SimpleParse »òÆäËü Python ½âÎöÆ÷ÌṩµÄ¶¼Òª¶à¡£È»¶ø£¬ÒòΪËüÍêÈ«ÊÇÓà Python ±àдµÄ£¬ËùÒÔËÙ¶ÈÒ²»á±È½ÏÂý¡£David ÔÚ±¾ÎÄÖÐÌÖÂÛÁË Spark Ä£¿é£¬¸ø³öÁËһЩ´úÂëÑù±¾£¬½âÊÍÁËËüµÄÓÃ;£¬²¢¶ÔÆäÓ¦ÓÃÁìÓòÌṩÁËһЩ½¨Òé¡£
¼Ì¡°¿É°®µÄ Python¡±ÏµÁÐÖÐרÃŽ²Êö SimpleParse µÄǰһƪÎÄÕÂÖ®ºó£¬ÎÒ½«ÔÚ±¾ÎÄÖмÌÐø½éÉÜһЩ½âÎöµÄ»ù±¾¸ÅÄ²¢¶Ô Spark Ä£¿é½øÐÐÁËÌÖÂÛ¡£½âÎö¿ò¼ÜÊÇÒ»¸öÄÚÈݷḻµÄÖ÷Ì⣬ËüÖµµÃÎÒÃǶ໨ʱ¼äȥȫÃæÁ˽⣻ÕâÁ½ÆªÎÄÕÂΪ¶ÁÕߺÍÎÒ×Ô¼º¶¼¿ªÁËÒ»¸öºÃÍ·¡£

ÔÚÈÕ³£µÄ±à³ÌÖУ¬ÎÒ¾­³£ÐèÒª±êʶ´æÔÚÓÚÎı¾ÎĵµÖеIJ¿¼þºÍ½á¹¹£¬ÕâЩÎĵµ°üÀ¨£ºÈÕÖ¾Îļþ¡¢ÅäÖÃÎļþ¡¢¶¨½çµÄÊý¾ÝÒÔ¼°¸ñʽ¸ü×ÔÓɵģ¨µ«»¹ÊÇ°ë½á¹¹»¯µÄ£©±¨±í¸ñʽ¡£ËùÓÐÕâЩÎĵµ¶¼ÓµÓÐËüÃÇ×Ô¼ºµÄ¡°Ð¡ÓïÑÔ¡±£¬ÓÃÓڹ涨ʲôÄܹ»³öÏÖÔÚÎĵµÄÚ¡£ÎÒ±àдÕâЩ·ÇÕýʽ½âÎöÈÎÎñµÄ³ÌÐòµÄ·½·¨×ÜÊÇÓеãÏó´óÔӻ⣬ÆäÖаüÀ¨¶¨ÖÆ״̬»ú¡¢ÕýÔò±í´ïʽÒÔ¼°ÉÏÏÂÎÄÇý¶¯µÄ×Ö·û´®²âÊÔ¡£ÕâЩ³ÌÐòÖеÄģʽ´ó¸Å×ÜÊÇÕâÑù£º¡°¶ÁһЩÎı¾£¬ÅªÇåÊÇ·ñ¿ÉÒÔÓÃËüÀ´×öЩʲô£¬È»ºó¿ÉÄÜÔÙ¶à¶ÁһЩÎı¾£¬Ò»Ö±³¢ÊÔÏÂÈ¥¡£¡±

½âÎöÆ÷½«ÎĵµÖв¿¼þºÍ½á¹¹µÄÃèÊöÌáÁ¶³É¼òÃ÷¡¢ÇåÎúºÍ˵Ã÷ÐԵĹæÔò£¬È·¶¨ÓÉʲô×é³ÉÎĵµ¡£´ó¶àÊýÕýʽµÄ½âÎöÆ÷¶¼Ê¹ÓÃÀ©Õ¹°Í¿Æ˹·¶Ê½£¨Extended Backus-Naur Form£¬EBNF£©ÉϵıäÌåÀ´ÃèÊöËüÃÇËùÃèÊöµÄÓïÑԵġ°Óï·¨¡±¡£»ù±¾ÉÏ£¬EBNF Óï·¨¶ÔÄú¿ÉÄÜÔÚÎĵµÖÐÕÒµ½µÄ²¿¼þ¸³ÓèÃû³Æ£»ÁíÍ⣬½Ï´óµÄ²¿¼þͨ³£ÓɽÏСµÄ²¿¼þ×é³É¡£Ð¡²¿¼þÔڽϴóµÄ²¿¼þÖгöÏÖµÄƵÂʺÍ˳ÐòÓɲÙ×÷·ûÖ¸¶¨¡£¾ÙÀýÀ´Ëµ£¬Çåµ¥ 1 ÊÇ EBNF Óï·¨ typographify.def£¬ÎÒÃÇÔÚ SimpleParse ÄÇƪÎÄÕÂÖмûµ½¹ýÕâ¸öÓï·¨£¨ÆäËü¹¤¾ßÔËÐеķ½Ê½ÉÔÓв»Í¬£©£º

Çåµ¥ 1. typographify.def
para := (plain / markup)+
plain := (word / whitespace / punctuation)+
whitespace := [ ]+
alphanums := [a-zA-Z0-9]+
word := alphanums, (wordpunct, alphanums)*, contraction?
wordpunct := [-_]
contraction := ""''"", (''am''/''clock''/''d''/''ll''/''m''/''re''/''s''/''t''/''ve'')
markup := emph / strong / module / code / title
emph := ''-'', plain, ''-''
strong := ''*'', plain, ''*''
module := ''['', plain, '']''
code := ""''"", plain, ""''""
title := ''_'', plain, ''_''
punctuation := (safepunct / mdash)
mdash := ''--''
safepunct := [!@#$%^&()+=|{}:;<>,.?/""]



Spark ¼ò½é
Spark ½âÎöÆ÷Óë EBNF Óï·¨ÓÐһЩ¹²Í¬Ö®´¦£¬µ«Ëü½«½âÎö£¯´¦Àí¹ý³Ì·Ö³ÉÁ˱ȴ«Í³µÄ EBNF Óï·¨ËùÔÊÐíµÄ¸üСµÄ×é¼þ¡£Spark µÄÓŵãÔÚÓÚ£¬Ëü¶ÔÕû¸ö¹ý³ÌÖÐÿһ²½²Ù×÷µÄ¿ØÖƶ¼½øÐÐÁË΢µ÷£¬»¹ÌṩÁ˽«¶¨ÖÆ´úÂë²åÈëµ½¹ý³ÌÖеÄÄÜÁ¦¡£ÄúÈç¹û¶Á¹ý±¾ÏµÁÐµÄ SimpleParse ÄÇƪÎÄÕ£¬Äú¾Í»á»ØÏëÆðÎÒÃǵĹý³ÌÊDZȽϴÖÂԵģº1£©´ÓÓï·¨£¨²¢´ÓÔ´Îļþ£©Éú³ÉÍêÕûµÄ±ê¼ÇÁÐ±í£¬2£©Ê¹Óñê¼ÇÁбí×÷Ϊ¶¨ÖƱà³Ì²Ù×÷µÄÊý¾Ý¡£

Spark Óë±ê×¼µÄ»ùÓÚ EBNF µÄ¹¤¾ßÏà±ÈȱµãÔÚÓÚ£¬Ëü±È½ÏÈß³¤£¬¶øÇÒȱÉÙÖ±½ÓµÄ³öÏÖ¼ÆÁ¿·û£¨¼´±íʾ´æÔڵġ°+¡±£¬±íʾ¿ÉÄÜÐԵġ°*¡±ºÍ±íʾÓÐÏÞÖÆÐԵġ°?¡±£©¡£¼ÆÁ¿·û¿ÉÒÔÔÚ Spark ¼ÇºÅ¸³ÓèÆ÷£¨tokenizer£©µÄÕýÔò±í´ïʽÖÐʹÓ㬲¢¿ÉÒÔÓýâÎö±í´ïʽÓï·¨ÖеĵݹéÀ´½øÐÐÄ£Äâ¡£Èç¹û Spark ÔÊÐíÔÚÓï·¨±í´ïʽÖÐʹÓüÆÁ¿£¬ÄǾ͸üºÃÁË¡£ÁíÒ»¸öÖµµÃÒ»ÌáµÄȱµãÊÇ£¬Spark µÄËÙ¶ÈÓë SimpleParse ʹÓõĻùÓÚ C µÄµ×²ã mxTextTools ÒýÇæÏà±ÈÑ·É«ºÜ¶à¡£

ÔÚ¡°Compiling Little Languages in Python¡±£¨Çë²ÎÔIJο¼×ÊÁÏ£©ÖУ¬Spark µÄ´´Ê¼ÈË John Aycock ½«±àÒëÆ÷·Ö³ÉÁËËĸö½×¶Î¡£±¾ÎÄÌÖÂÛµÄÎÊÌâÖ»Éæ¼°µ½Ç°ÃæÁ½¸ö°ë½×¶Î£¬Õâ¹é¾ÌÓÚÁ½·½ÃæÔ­Òò£¬Ò»ÊÇÓÉÓÚÎÄÕ³¤¶ÈµÄÏÞÖÆ£¬¶þÊÇÒòΪÎÒÃǽ«Ö»ÌÖÂÛǰһƪÎÄÕÂÌá³öµÄͬÑùµÄÏà¶ÔÀ´Ëµ±È½Ï¼òµ¥µÄ¡°Îı¾±ê¼Ç¡±ÎÊÌâ¡£Spark »¹¿ÉÒÔ½øÒ»²½ÓÃ×÷ÍêÕûÖÜÆڵĴúÂë±àÒëÆ÷£¯½âÊÍÆ÷£¬¶ø²»ÊÇÖ»ÓÃÓÚÎÒËùÃèÊöµÄ¡°½âÎö²¢´¦Àí¡±µÄÈÎÎñ¡£ÈÃÎÒÃÇÀ´¿´¿´ Aycock Ëù˵µÄËĸö½×¶Î£¨ÒýÓÃʱÓÐËùɾ½Ú£©£º

ɨÃ裬Ҳ³Æ´Ê·¨·ÖÎö¡£½«ÊäÈëÁ÷·Ö³ÉÒ»ÁмǺš£
½âÎö£¬Ò²³ÆÓï·¨·ÖÎö¡£È·±£¼ÇºÅÁбíÔÚÓï·¨ÉÏÊÇÓÐЧµÄ¡£
ÓïÒå·ÖÎö¡£±éÀú³éÏóÓï·¨Ê÷£¨abstract syntax tree£¬AST£©Ò»´Î»ò¶à´Î£¬ÊÕ¼¯ÐÅÏ¢²¢¼ì²éÊäÈë³ÌÐò makes sense¡£
Éú³É´úÂë¡£ÔٴαéÀú AST£¬Õâ¸ö½×¶Î¿ÉÄÜÓà C »ò»ã±àÖ±½Ó½âÊͳÌÐò»òÊä³ö´úÂë¡£
¶Ôÿ¸ö½×¶Î£¬Spark ¶¼ÌṩÁËÒ»¸ö»ò¶à¸ö³éÏóÀàÒÔÖ´ÐÐÏàÓ¦²½Ö裬»¹ÌṩÁËÒ»¸öÉÙ¼ûµÄЭÒ飬´Ó¶øÌØ»¯ÕâЩÀà¡£Spark ¾ßÌåÀಢ²»Ïó´ó¶àÊý¼Ì³ÐģʽÖеÄÀàÄÇÑù½ö½öÖØж¨Òå»òÌí¼ÓÌض¨µÄ·½·¨£¬¶øÊǾßÓÐÁ½ÖÖÌØÐÔ£¨Ò»°ãµÄģʽÓë¸÷½×¶ÎºÍ¸÷ÖÖ¸¸Ä£Ê½¶¼Ò»Ñù£©¡£Ê×ÏÈ£¬¾ßÌåÀàËùÍê³ÉµÄ´ó²¿·Ö¹¤×÷¶¼ÔÚ·½·¨µÄÎĵµ×Ö·û´®£¨docstring£©ÖÐÖ¸¶¨¡£µÚ¶þ¸öÌØÊâµÄЭÒéÊÇ£¬ÃèÊöģʽµÄ·½·¨¼¯½«±»¸³Óè±íÃ÷Æä½ÇÉ«µÄ¶ÀÌØÃû³Æ¡£¸¸Àà·´¹ýÀ´°üº¬²éÕÒʵÀýµÄ¹¦ÄÜÒÔ½øÐвÙ×÷µÄÄÚÊ¡£¨introspective£©·½·¨¡£ÎÒÃÇÔڲο´Ê¾ÀýµÄʱºî»á¸üÇå³þµØÈÏʶµ½ÕâÒ»µã¡£

ʶ±ðÎı¾±ê¼Ç
ÎÒÒѾ­Óü¸ÖÖÆäËüµÄ·½·¨½â¾öÁËÕâÀïµÄÎÊÌâ¡£ÎÒ½«Ò»ÖÖÎÒ³Æ֮Ϊ¡°ÖÇÄÜ ASCII¡±µÄ¸ñʽÓÃÓÚ¸÷ÖÖÄ¿µÄ¡£ÕâÖÖ¸ñʽ¿´ÆðÀ´ºÜÏóΪµç×ÓÓʼþºÍÐÂÎÅ×éͨÐÅ¿ª·¢µÄÄÇЩЭ¶¨¡£³öÓÚ¸÷ÖÖÄ¿µÄ£¬ÎÒ½«ÕâÖÖ¸ñʽ×Ô¶¯µØת»»ÎªÆäËü¸ñʽ£¬Èç HTML¡¢XML ºÍ LaTeX¡£ÎÒÔÚÕâÀﻹҪÔÙÕâÑù×öÒ»´Î¡£ÎªÁËÈÃÄúÖ±¹ÛµØÀí½âÎÒµÄÒâ˼£¬ÎÒ½«ÔÚ±¾ÎÄÖÐʹÓÃÏÂÃæÕâ¸ö¼ò¶ÌµÄÑù±¾£º

Çåµ¥ 2. ÖÇÄÜ ASCII Ñù±¾Îı¾£¨p.txt£©
Text with *bold*, and -itals phrase-, and [module]--this
should be a good ''practice run''.


³ýÁËÑù±¾ÎļþÖеÄÄÚÈÝ£¬»¹ÓÐÁíÍâÒ»µãÄÚÈÝÊǹØÓÚ¸ñʽµÄ£¬µ«²»ÊǺܶࣨ¾¡¹ÜµÄÈ·ÓÐһЩϸ΢֮´¦ÊǹØÓÚ±ê¼ÇÓë±êµãÈçºÎ½»»¥µÄ£©¡£

Éú³É¼ÇºÅ
ÎÒÃÇµÄ Spark¡°ÖÇÄÜ ASCII¡±½âÎöÆ÷ÐèÒª×öµÄµÚÒ»¼þʾÍÊǽ«ÊäÈëÎı¾·Ö³ÉÏà¹ØµÄ²¿¼þ¡£ÔڼǺŸ³ÓèÕâÒ»²ã£¬ÎÒÃÇ»¹²»ÏëÌÖÂÛÈçºÎ¹¹Ôì¼ÇºÅ£¬ÈÃËüÃÇά³ÖÔ­Ñù¾Í¿ÉÒÔÁË¡£ÉÔºóÎÒÃǻὫ¼ÇºÅÐòÁÐ×éºÏ³É½âÎöÊ÷¡£

ÉÏÃæµÄ typographify.def ÖÐËùʾµÄÓï·¨ÌṩÁË Spark ´Ê·¨·ÖÎö³ÌÐò£¯É¨Ãè³ÌÐòµÄÉè¼ÆÖ¸ÄÏ¡£Çë×¢Ò⣬ÎÒÃÇÖ»ÄÜʹÓÃÄÇЩÔÚɨÃè³ÌÐò½×¶ÎΪ¡°Ô­ÓµÄÃû³Æ¡£Ò²¾ÍÊÇ˵£¬ÄÇЩ°üÀ¨ÆäËüÒÑÃüÃûµÄģʽµÄ£¨¸´ºÏ£©Ä£Ê½ÔÚ½âÎö½×¶Î±ØÐë±»ÑÓ³Ù¡£³ýÁËÕâÑù£¬ÎÒÃÇÆäʵ»¹¿ÉÒÔÖ±½Ó¸´ÖƾɵÄÓï·¨¡£

Çåµ¥ 3. ɾ½ÚºóµÄ wordscanner.py Spark ½Å±¾
class WordScanner(GenericScanner):
""Tokenize words, punctuation and markup""
def tokenize(self, input):
self.rv = []
GenericScanner.tokenize(self, input)
return self.rv
def t_whitespace(self, s):
r"" [ ]+ ""
self.rv.append(Token(''whitespace'', '' ''))
def t_alphanums(self, s):
r"" [a-zA-Z0-9]+ ""
print ""{word}"",
self.rv.append(Token(''alphanums'', s))
def t_safepunct(self, s): ...
def t_bracket(self, s): ...
def t_asterisk(self, s): ...
def t_underscore(self, s): ...
def t_apostrophe(self, s): ...
def t_dash(self, s): ...

class WordPlusScanner(WordScanner):
""Enhance word/markup tokenization""
def t_contraction(self, s):
r"" (?<=[a-zA-Z])''(am|clock|d|ll|m|re|s|t|ve) ""
self.rv.append(Token(''contraction'', s))
def t_mdash(self, s):
r'' -- ''
self.rv.append(Token(''mdash'', s))
def t_wordpunct(self, s): ...



ÕâÀïÓÐÒ»¸öÓÐȤµÄµØ·½¡£WordScanner ±¾ÉíÊÇÒ»¸öÍêÃÀµÄɨÃè³ÌÐòÀࣻµ« Spark ɨÃè³ÌÐòÀà±¾Éí¿ÉÒÔͨ¹ý¼Ì³Ð½øÒ»²½ÌØ»¯£º×ÓÕýÔò±í´ïʽģʽÔÚ¸¸ÕýÔò±í´ïʽ֮ǰƥÅ䣬¶øÈç¹ûÐèÒª£¬×Ó·½·¨£¯ÕýÔò±í´ïʽ¿ÉÒÔ¸²¸Ç¸¸·½·¨£¯ÕýÔò±í´ïʽ¡£ËùÒÔ£¬WordPlusScanner ½«ÔÚ WordScanner ֮ǰ¶ÔÌØ»¯½øÐÐÆ¥Å䣨¿ÉÄÜ»áÒò´ËÏÈ»ñȡһЩ×Ö½Ú£©¡£Ä£Ê½Îĵµ×Ö·û´®ÖÐÔÊÐíʹÓÃÈκÎÕýÔò±í´ïʽ£¨¾ÙÀýÀ´Ëµ£¬.t_contraction() ·½·¨°üº¬Ä£Ê½ÖеÄÒ»¸ö¡°Ïòºó²åÈ롱£©¡£

²»ÐÒµÄÊÇ£¬Python 2.2 ÔÚÒ»¶¨³Ì¶ÈÉÏÆÆ»µÁËɨÃè³ÌÐò¼Ì³ÐÂß¼­¡£ÔÚ Python 2.2 ÖУ¬²»¹ÜÔڼ̳ÐÁ´ÖеÄʲôµØ·½¶¨Ò壬ËùÓж¨Òå¹ýµÄģʽ¶¼°´×Öĸ˳Ðò£¨°´Ãû³Æ£©½øÐÐÆ¥Åä¡£ÒªÐÞÕýÕâ¸öÎÊÌ⣬Äú¿ÉÒÔÔÚ Spark º¯Êý _namelist() ÖÐÐÞ¸ÄÒ»ÐдúÂ룺

Çåµ¥ 4. ¾ÀÕýºóÏàÓ¦µÄ spark.py º¯Êý
def _namelist(instance):
namelist, namedict, classlist = [], {}, [instance.__class__]
for c in classlist:
for b in c.__bases__:
classlist.append(b)
# for name in dir(c): # dir() behavior changed in 2.2
for name in c.__dict__.keys(): # <-- USE THIS
if not namedict.has_key(name):
namelist.append(name)
namedict[name] = 1
return namelist



ÎÒÒѾ­Ïò Spark ´´Ê¼ÈË John Aycock ֪ͨÁËÕâ¸öÎÊÌ⣬½ñºóµÄ°æ±¾»áÐÞÕýÕâ¸öÎÊÌ⡣ͬʱ£¬ÇëÔÚÄú×Ô¼ºµÄ¸±±¾ÖÐ×÷³öÐ޸ġ£

ÈÃÎÒÃÇÀ´¿´¿´£¬WordPlusScanner ÔÚÓ¦Óõ½ÉÏÃæÄǸö¡°ÖÇÄÜ ASCII¡±Ñù±¾Öкó»á·¢Éúʲô¡£Ëü´´½¨µÄÁбíÆäʵÊÇÒ»¸ö Token ʵÀýµÄÁÐ±í£¬µ«ËüÃÇ°üº¬Ò»¸ö .__repr__ ·½·¨£¬¸Ã·½·¨ÄÜÈÃËüÃǺܺõØÏÔʾÒÔÏÂÐÅÏ¢£º

Çåµ¥ 5. Óà WordPlusScanner Ïò¡°ÖÇÄÜ ASCII¡±¸³Óè¼ÇºÅ
>>> from wordscanner import WordPlusScanner
>>> tokens = WordPlusScanner().tokenize(open(''p.txt'').read())
>>> filter(lambda s: s<>''whitespace'', tokens)
[Text, with, *, bold, *, ,, and, -, itals, phrase, -, ,, and, [,
module, ], --, this, should, be, a, good, '', practice, run, '', .]



ÖµµÃ×¢ÒâµÄÊǾ¡¹Ü .t_alphanums() Ö®ÀàµÄ·½·¨»á±» Spark ÄÚÊ¡¸ù¾ÝÆäǰ׺¡°t_¡±Ê¶±ð£¬ËüÃÇ»¹ÊÇÕýÔò·½·¨¡£Ö»ÒªÅöµ½ÏàÓ¦µÄ¼ÇºÅ£¬·½·¨ÄÚµÄÈκζîÍâ´úÂ붼½«Ö´ÐС£.t_alphanums() ·½·¨°üº¬Ò»¸ö¹ØÓڴ˵ãµÄºÜСµÄʾÀý£¬ÆäÖаüº¬Ò»Ìõ print Óï¾ä¡£

Éú³É³éÏóÓï·¨Ê÷
²éÕҼǺŵÄÈ·ÓÐÒ»µãÒâ˼£¬µ«ÕæÕýÓÐÒâ˼µÄÊÇÈçºÎÏò¼ÇºÅÁбíÓ¦ÓÃÓï·¨¡£½âÎö½×¶ÎÔڼǺÅÁбíµÄ»ù´¡ÉÏ´´½¨ÈÎÒâµÄÊ÷½á¹¹¡£ËüÖ»ÊÇÖ¸¶¨Á˱í´ïʽÓï·¨¶øÒÑ¡£

Spark Óкü¸ÖÖ´´½¨ AST µÄ·½·¨¡£¡°ÊÖ¹¤¡±µÄ·½·¨ÊÇÌØ»¯ GenericParser Àà¡£ÔÚÕâÖÖÇé¿öÏ£¬¾ßÌå×Ó½âÎöÆ÷»áÌṩºÜ¶à·½·¨£¬·½·¨ÃûµÄÐÎʽΪ p_foobar(self, args)¡£Ã¿¸öÕâÑùµÄ·½·¨µÄÎĵµ×Ö·û´®¶¼°üº¬Ò»¸ö»ò¶à¸öģʽµ½Ãû³ÆµÄ·ÖÅä¡£Ö»ÒªÓï·¨±í´ïʽƥÅ䣬ÿÖÖ·½·¨¾Í¿ÉÒÔ°üº¬ÈκÎÒªÖ´ÐеĴúÂë¡£

È»¶ø£¬Spark »¹ÌṩһÖÖ¡°×Ô¶¯¡±Éú³É AST µÄ·½Ê½¡£ÕâÖÖ·ç¸ñ´Ó GenericASTBuilder Àà¼Ì³Ð¶øÀ´¡£ËùÓÐÓï·¨±í´ïʽ¶¼ÔÚÒ»¸ö×î¸ß¼¶µÄ·½·¨ÖÐÁгö£¬¶ø .terminal() ºÍ .nonterminal() ·½·¨¿ÉÒÔ±»ÌØ»¯ÎªÔÚÉú³ÉÆÚ¼ä²Ù×÷×ÓÊ÷£¨Èç¹ûÐèÒª£¬Ò²¿ÉÒÔÖ´ÐÐÈκÎÆäËü²Ù×÷£©¡£½á¹û»¹ÊÇ AST£¬µ«¸¸Àà»áΪÄúÖ´Ðд󲿷ֹ¤×÷¡£ÎÒµÄÓï·¨ÀàºÍÈçÏÂËùʾµÄ²î²»¶à£º

Çåµ¥ 6. ɾ½ÚºóµÄ markupbuilder.py Spark ½Å±¾
class MarkupBuilder(GenericASTBuilder):
""Write out HTML markup based on matched markup""
def p_para(self, args):
''''''
para ::= plain
para ::= markup
para ::= para plain
para ::= para emph
para ::= para strong
para ::= para module
para ::= para code
para ::= para title
plain ::= whitespace
plain ::= alphanums
plain ::= contraction
plain ::= safepunct
plain ::= mdash
plain ::= wordpunct
plain ::= plain plain
emph ::= dash plain dash
strong ::= asterisk plain asterisk
module ::= bracket plain bracket
code ::= apostrophe plain apostrophe
title ::= underscore plain underscore
''''''
def nonterminal(self, type_, args):
# Flatten AST a bit by not making nodes if only one child.
if len(args)==1: return args[0]
if type_==''para'': return nonterminal(self, type_, args)
if type_==''plain'':
args[0].attr = foldtree(args[0])+foldtree(args[1])
args[0].type = type_
return nonterminal(self, type_, args[:1])
phrase_node = AST(type_)
phrase_node.attr = foldtree(args[1])
return phrase_node



ÎÒµÄ .p_para() ÔÚÆäÎĵµ×Ö·û´®ÖÐÓ¦¸ÃÖ»°üº¬Ò»×éÓï·¨¹æÔò£¨Ã»ÓдúÂ룩¡£ÎÒ¾ö¶¨×¨ÃÅÓà .nonterminal() ·½·¨À´ÉÔ΢¶Ô AST ½øÐÐƽÆÌ¡£ÓÉһϵÁС°plain¡±×ÓÊ÷×é³ÉµÄ¡°plain¡±½Úµã½«×ÓÊ÷ѹËõΪһ¸ö¸ü³¤µÄ×Ö·û´®¡£Í¬Ñù£¬±ê¼Ç×ÓÊ÷£¨¼´¡°emph¡±¡¢¡°strong¡±¡¢¡°module¡±¡¢¡°code¡±ºÍ¡°title¡±£©ÕÛµþΪһ¸öÀàÐÍÕýÈ·µÄµ¥¶À½Úµã£¬²¢°üº¬Ò»¸ö¸´ºÏ×Ö·û´®¡£

ÎÒÃÇÒѾ­Ìáµ½¹ý£¬Spark Óï·¨ÖÐÏÔȻȱÉÙÒ»Ñù¶«Î÷£ºÃ»ÓмÆÁ¿·û¡£Í¨¹ýÏÂÃæÕâÑùµÄ¹æÔò£¬

plain ::= plain plain


ÎÒÃÇ¿ÉÒԳɶԵؾۼ¯¡°plain¡°ÀàÐ͵Ä×ÓÊ÷¡£²»¹ýÎÒ¸üÇãÏòÓÚÈà Spark ÔÊÐíʹÓøüÀàËÆÓÚ EBNF ·ç¸ñµÄÓï·¨±í´ïʽ£¬ÈçÏÂËùʾ£º

plain ::= plain+


È»ºó£¬ÎÒÃǾͿÉÒÔ¸ü¼òµ¥µØ´´½¨¡°plain ¾¡¿ÉÄܶࡱµÄ n-ary ×ÓÊ÷ÁË¡£¼ÈÈ»ÕâÑù£¬ÎÒÃǵÄÊ÷¾Í¸üÈÝÒ×Æô¶¯ÁУ¬ÉõÖÁ²»ÓÃÔÚ .nonterminal() Öд«ËÍÏûÏ¢¡£

ʹÓÃÊ÷
Spark Ä£¿éÌṩÁ˼¸¸öʹÓà AST µÄÀà¡£±ÈÆðÎÒµÄÄ¿µÄÀ´Ëµ£¬ÕâЩÔðÈαÈÎÒÐèÒªµÄ¸ü´ó¡£Èç¹ûÄúÏ£ÍûµÃµ½ËüÃÇ£¬GenericASTTraversal ºÍ GenericASTMatcher ÌṩÁ˱éÀúÊ÷µÄ·½·¨£¬Ê¹Óõļ̳ÐЭÒéÀàËÆÓÚÎÒÃÇΪɨÃè³ÌÐòºÍ½âÎöÆ÷ËùÌṩµÄ¡£

µ«ÊÇÖ»Óõݹ麯ÊýÀ´±éÀúÊ÷²¢²»Ê®·ÖÀ§ÄÑ¡£ÎÒÔÚÎÄÕµÄѹËõÎļþ prettyprint.py£¨Çë²ÎÔIJο¼×ÊÁÏ£©Öд´½¨ÁËһЩÕâÑùµÄʾÀý¡£ÆäÖеÄÒ»¸öÊÇ showtree()¡£¸Ãº¯Êý½«ÏÔʾһ¸ö´øÓм¸¸öÔ¼¶¨µÄ AST¡£

ÿÐж¼ÏÔʾϽµÉî¶È
Ö»ÓÐ×ӽڵ㣨ûÓÐÄÚÈÝ£©µÄ½Úµã¿ªÍ·ÓÐÆÆÕÛºÅ
½ÚµãÀàÐÍÓÃË«²ã¼âÀ¨ºÅÀ¨Æð
ÈÃÎÒÃÇÀ´¿´¿´ÉÏÃæʾÀýÖÐÉú³ÉµÄ AST£º

Çåµ¥ 7. Óà WordPlusScanner Ïò¡°ÖÇÄÜ ASCII¡±¸³Óè¼ÇºÅ
>>> from wordscanner import tokensFromFname
>>> from markupbuilder import treeFromTokens
>>> from prettyprint import showtree
>>> showtree(treeFromTokens(tokensFromFname(''p.txt'')))
0 <>
1 - <>
2 -- <>
3 --- <>
4 ---- <>
5 ----- <>
6 ------ <>
7 ------- <>
8 -------- <>
9 <> Text with
8 <> bold
7 ------- <>
8 <> , and
6 <> itals phrase
5 ----- <>
6 <> , and
4 <> module
3 --- <>
4 <> --this should be a good
2 <> practice run
1 - <>
2 <> .



Àí½âÊ÷½á¹¹ºÜÖ±¹Û£¬µ«ÎÒÃÇÕæÕýҪѰÕÒµÄÐ޸ĹýµÄ±ê¼ÇÔõô°ìÄØ£¿ÐÒÔ˵ÄÊÇ£¬Ö»ÐèÒª¼¸ÐдúÂë¾Í¿ÉÒÔ±éÀúÊ÷²¢Éú³ÉËü£º

Çåµ¥ 8. ´Ó AST£¨prettyprint.py£©Êä³ö±ê¼Ç
def emitHTML(node):
from typo_html import codes
if hasattr(node, ''attr''):
beg, end = codes[node.type]
sys.stdout.write(beg+node.attr+end)
else: map(emitHTML, node._kids)


typo_html.py ÎļþÓë SimpleParse ÄÇƪÎÄÕÂÖеÄÒ»Ñù ¡ª ËüÖ»ÊÇ°üº¬Ò»¸ö½«Ãû³ÆÓ³Éäµ½¿ªÊ¼±ê¼Ç£¯½áÊø±ê¼Ç¶ÔµÄ×ֵ䡣ÏÔÈ»£¬ÎÒÃÇ¿ÉÒÔΪ±ê¼ÇʹÓóý HTML Ö®ÍâµÄÏàͬ·½·¨¡£Èç¹ûÄú²»Çå³þ£¬ÏÂÃæÊÇÎÒÃǵÄʾÀý½«Éú³ÉµÄÄÚÈÝ£º

Çåµ¥ 9. Õû¸ö¹ý³ÌµÄ HTML Êä³ö
Text with bold, and itals phrase,
and module--this should be a good
practice run.


½áÊøÓï
ºÜ¶à Python ³ÌÐòÔ±¶¼ÏòÎÒÍƼö Spark¡£ËäÈ» Spark ʹÓõÄÉÙ¼ûµÄЭ¶¨ÈÃÈ˲»Ì«ÈÝÒ×Ï°¹ß£¬¶øÇÒÎĵµ´ÓijЩ½Ç¶ÈÀ´¿´¿ÉÄܱȽϺ¬»ì²»Ç壬µ« Spark µÄÁ¦Á¿»¹ÊǷdz£ÁîÈ˾ªÆæ¡£Spark ʵÏֵıà³Ì·ç¸ñʹ×îÖÕ³ÌÐòÔ±Äܹ»ÔÚɨÃ裯½âÎö¹ý³ÌÖÐÔÚÈκεط½²åÈë´úÂë¿é ¡ª Õâ¶Ô×îÖÕÓû§À´ËµÍ¨³£ÊÇ¡°ºÚÏ䡱¡£

±ÈÆðËüµÄËùÓÐÓŵãÀ´Ëµ£¬ÎÒ·¢ÏÖ Spark ÕæÕýµÄȱµãÊÇËüµÄËٶȡ£Spark ÊÇÎÒʹÓùýµÄµÚÒ»¸ö Python ³ÌÐò£¬¶øÎÒÔÚʹÓÃÖз¢ÏÖ£¬½âÊÍÓïÑÔµÄËÙ¶ÈËðʧÊÇÆäÖ÷ÒªÎÊÌâ¡£Spark µÄËٶȵÄÈ·ºÜÂý£»ÂýµÄ³Ì¶È²»Ö¹ÊÇ¡°ÎÒÏ£ÍûÄÜ¿ìÒ»µãµã¡±£¬¶øÊÇ¡°³ÔÁËÒ»¶Ù³¤Ê±¼äµÄÎç²Í»¹Ï£ÍûËüÄÜ¿ìµã½áÊø¡±µÄ³Ì¶È¡£ÔÚÎÒµÄʵÑéÖУ¬¼ÇºÅ¸³ÓèÆ÷»¹±È½Ï¿ì£¬µ«½âÎö¹ý³Ì¾ÍºÜÂýÁË£¬¼´±ãÓúÜСµÄ²âÊÔ°¸ÀýÒ²ºÜÂý¡£¹«Æ½µØ½²£¬John Aycock ÒѾ­ÏòÎÒÖ¸³ö£¬Spark ʹÓÃµÄ Earley ½âÎöËã·¨±È¸ü¼òµ¥µÄ LR Ë㷨ȫÃæµÃ¶à£¬ÕâÊÇËüËÙ¶ÈÂýµÄÖ÷ÒªÔ­Òò¡£»¹ÓпÉÄܵÄÊÇ£¬ÓÉÓÚÎÒ¾­Ñé²»×㣬¿ÉÄÜÉè¼Æ³öµÍЧµÄÓï·¨£»²»¹ý¾ÍËãÊÇÕâÑù£¬´ó²¿·ÖÓû§Ò²ºÜ¿ÉÄÜ»áÏóÎÒÒ»Ñù¡£

²Î¿¼×ÊÁÏ

׫д±¾ÎĵĻù´¡ÊÇ David ÔÚǰһƪ¡°Charming Python£ºParsing with the SimpleParse module¡±£¨developerWorks£¬2002 Äê 1 Ô£©ÖеÄÒ»¸öÌÖÂÛ¡£


ÔÚ John Aycock µÄ Spark Ö÷Ò³ÉÏÏÂÔØ Spark ²¢»ñÈ¡¸ü¶àÐÅÏ¢¡£


ÔÚ Spark Ö÷Ò³ÉÏ£¬ÄúÒª¿´µÄ×îÖØÒªµÄÎĵµÊÇ Spark ¿ò¼Ü×î³õµÄ½éÉÜ´Ç¡°Compiling Little Languages in Python¡±£¬ÓÉ John Aycock ±àд¡£


Äú¿ÉÒÔÔÚ Mike Fletcher µÄ SimpleParse Ò³ÃæÉÏÕÒµ½ SimpleParse ºÍ¶ÔÆäʹÓÃÇé¿öµÄÒ»·Ý¼òµ¥½éÉÜ¡£


mxTextTools ÏÖÔÚÊǸü´óµÄ eGenix À©Õ¹°üµÄÒ»²¿·Ö¡£


Äú¿ÉÒÔÔÚ Markus Kuhn ÌṩµÄÒ»¸öÒ³ÃæÉÏÕÒµ½¹ØÓÚ EBNF Óï·¨µÄ ISO 14977 ±ê×¼µÄÐÅÏ¢¡£


David ×Ô¼ºµÄÎĵµÖÐÓб¾ÎÄÖÐÌáµ½µÄÎļþ¡£


ÔÚ developerWorks Linux רÇø²éÕÒ¸ü¶à Linux ÎÄÕ¡£

¹ØÓÚ×÷Õß
David Mertz ¿ÉÄÜÏ£ÍûÕâÑùд£¨ÓÃÄá²ÉµÄ»°£©£¬ÕâЩÊÇÒ»¸öÀÏѧÕßµÄÚ¤Ï룬µ«»ÑÑÔ½«²»¹¥×ÔÆÆ¡£²»¹ýËû½ÓÏÂÀ´µÄ£¨Ãâ·Ñ¼ÓÈëµÄ£©Text Processing in Python Ò»ÊéÔÚ½«À´Ä³Ò»Ìì»òÐíÒ²»á±»ÎóÈÏΪһÖÖÓïÑÔѧ¿ØÖÆÂÛ¡£Äú¿ÉÒÔͨ¹ý [email protected] ÁªÏµ David£»ËûµÄÀͶ¯³É¹û¶¼¼¯ÖÐÔÚ http://gnosis.cx/publish/ ÉÏ¡£»¶Ó­Äú¶ÔÕâ¸öÏÖÔڵġ¢¹ýÈ¥µÄ»ò½«À´µÄרÀ¸Ìá³ö½¨Òé¡£