是否有良好的开放源发动机来探测案文中哪一种语言,也许有可能会衡量尺度? 我能够在当地运行, t谷歌或 B? 我愿在大约1 500万页的奥古拉角文字中检测每一页的语文。
并非所有文件都含有使用拉丁字母的语文。
是否有良好的开放源发动机来探测案文中哪一种语言,也许有可能会衡量尺度? 我能够在当地运行, t谷歌或 B? 我愿在大约1 500万页的奥古拉角文字中检测每一页的语文。
并非所有文件都含有使用拉丁字母的语文。
视你的工作情况,你可能要检查假日自然语言处理工具包,该工具包对拜斯人学习阿尔戈斯有一些支持。
总的说来,文字和文字频率可能是最快的评价,但是如果你需要做除确定语言以外的任何事情的话,NLTK(或整个海湾学习算法)可能是有益的。 如果你发现头两种方法的错误率过高,那么巴塞尼亚的方法可能也会有用。
http://www.letterfters.org/#UK-english-language-letter-fter”rel=“noretinger”>given some statistics about Relative_frequen_of_letters_in_other_languages”rel=“noretinger”letter frequencies/,
然后作为公开来源予以释放。 和voila,你有发现文本语言的开放源发动机!
就未来而言,第一发动机使用的是BSD许可证规定的校正文,但自2003年以来似乎并未保持。 尽管如此,它还是做了出色的工作,很容易融入我的工具链。
Try CLD:
安装
export CPPFLAGS="-std=c++98" # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user
页: 1
import cld2
res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字") # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))
Gives
Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u ENGLISH , language_code=u en , percent=95, score=1675.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u GERMAN , language_code=u de , percent=96, score=1496.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u FRENCH , language_code=u fr , percent=97, score=1134.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u ARABIC , language_code=u ar , percent=97, score=1263.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u Japanese , language_code=u ja , percent=98, score=3848.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Supports 282 languages.
我并不认为你需要任何非常复杂的东西,例如,如果文件是英文,具有很高的确定性,那么只要文件含有最普通的英语话,就只能加以检验,例如:
"the a an is to are in on in it"
如果包含所有这些内容,我想说这几乎是英语。
否则,你可以尝试Rubya s WhatLanguage gem,在Twitter数据分析中使用的是冰和简单,我是用的。 检查::http://www.you Programme.com/watch?v=lNqZ2cqOReo&list=UJ_3fstMOH-g4yBxtvgkw&index=0&fe;pcpla 快速行动
http://github.com/wooorm/franc“rel=“nofollow” Franc on Github。 该文件在 Java文中撰写,因此,你也可以在浏览器中使用,也可能在诺德使用。
- franc supports more languages than any other library, or Google;
- franc is easily forked to support 335 languages; franc is just as
- fast as the competition.
In other words, how can I tell if the person using my web application is on the server it resides on? If I remember correctly, PHPMyAdmin does something like this for security reasons.
I want to know how to detect GPS HardWare in present in Iphone or not
Is there a way to track the right click menu when clicked over a textarea. I would like to know if the user selected cut,copy,paste,select all. Also, I can know when the menu is visible by detecting ...
Can anyone help me out, how to detect if MSXML parser is installed on a machine or not. I looked for a registry entry,but unable to get one. I am writing a VB.NET application. Thanks in advance :)
I want to detect encoding correct, but i found mb_detect_encoding always get error result, And I added lots of encoding_list UTF8 ISO-8859-* ....
I am implementing a cross platform scripting language for our product. There is a requirement to detect and properly handle stack overflow condition in language VM. Before you jump in and say make ...
I am the webmaster for a small, growing industrial association. Soon, I will have to implement a restricted, members-only section for the website. The problem is that our organization membership both ...
How can I detect the page zoom level in all modern browsers? While this thread tells how to do it in IE7 and IE8, I can t find a good cross-browser solution. Firefox stores the page zoom level for ...