Question

我如何自动执行测试以强制Python2.x代码体不包含字符串实例（仅包含unicode实例）？

例如。

我可以在代码内完成吗？

有没有具有此功能的静态分析工具？

编辑：

我希望这是Python 2.5中的一个应用程序，但事实证明这是不可能的，因为：

2.5 doesn t support unicode_literals
kwargs dictionary keys can t be unicode objects, only strings

因此，我接受了一个答案，即这是不可能的，尽管原因不同：）

Answer 1

您不能强制所有字符串都是Unicode；即使在模块中使用来自__future_import-unicode_literals的，字节字符串也可以写成b，就像它们在Python 3中一样。



有一个选项，可用于全局获得与unicode_literals相同的效果：命令行选项-U。然而，它在2.x系列的早期就被放弃了，因为它基本上破坏了所有的剧本。

你这样做的目的是什么？废除字节字符串是不可取的。它们并不是“坏”的，Unicode字符串也不是普遍“好”的；它们是两种不同的动物，你需要它们。与二进制文件和网络服务通信肯定需要字节字符串。

如果你想准备好过渡到Python3，最好的策略是编写＜code＞b表示您真正想要成为字节的所有字符串，并且u，用于固有的Unicode字符串。默认字符串＜code＞格式可以用于其他一切，不关心的地方和/或Python3是否更改默认字符串类型。

Answer 2


在我看来，你真的需要用一个诚实的python解析器来解析代码。然后，您需要深入分析解析器生成的AST，看看它是否包含任何字符串文字。

看起来Python自带了一个开箱即用的解析器。从此文档我让这个代码示例正常工作：

import parser
from token import tok_name

def checkForNonUnicode(codeString):
    return checkForNonUnicodeHelper(parser.suite(codeString).tolist())

def checkForNonUnicodeHelper(lst):
    returnValue = True
    nodeType = lst[0]
    if nodeType in tok_name and tok_name[nodeType] ==  STRING :
        stringValue = lst[1]
        if stringValue[0] != "u": # Kind of hacky. Does this always work?
            print "%s is not unicode!" % stringValue
            returnValue = False

    else:
        for subNode in [lst[n] for n in range(1, len(lst))]:
            if isinstance(subNode, list):
                returnValue = returnValue and checkForNonUnicodeHelper(subNode)

    return returnValue

print checkForNonUnicode("""
def foo():
    a =  This should blow up! 
""")
print checkForNonUnicode("""
def bar():
    b = u although this is ok. 
""")


打印出来的

 This should blow up!  is not unicode!
False
True

现在，文档字符串不是unicode，但应该是允许的，所以您可能需要做一些更复杂的事情，比如symbol import sym_name中的，在那里您可以查找哪些节点类型用于类和函数定义。然后，第一个子节点，它只是一个字符串，即不是赋值的一部分或其他什么，应该被允许不是unicode。



好问题！

编辑

只是一个后续评论。为了方便起见，＜code＞parser.suite＜/code＞实际上并不评估您的python代码。这意味着您可以在Python文件上运行此解析器，而不用担心命名或导入错误。例如，假设您有myObscureUtilityFile.py，它包含

from ..obscure.relative.path import whatever


你可以

checkForNonUnicode(open( /whoah/softlink/myObscureUtilityFile.py ).read())

Answer 3

我们的SD源代码搜索引擎（SCSE）可以直接提供此结果。

The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.

It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").

所以搜索：

  for  ... I=ij*

查找前缀为“ij”的标识符near（“…”）的关键字，并显示所有命中数。（语言特定的空格，包括换行符和注释，将被忽略。

琐碎的搜索：

查找所有字符串文字。这通常是一个相当大的集合：-}

搜索

 UnicodeStrings

查找在词汇上定义为Unicode字符串的所有字符串文字（u“…”）

您需要的是不是UnicodeStrings的所有字符串。SCSE提供了一个“减法”运算符，用于减去与另一种命中重叠的一种命中。因此，您的问题“哪些字符串不是unicode”简明地表示为：

  S-UnicodeStrings

所有显示的点击都将是不是unicode字符串的字符串，这是你的确切问题。

SCSE提供了日志记录功能，以便您可以记录命中次数。您可以从命令行运行SCSE，为您的答案启用脚本查询。将其放入命令脚本将提供一个直接给出答案的工具。

友情链接