English 中文(简体)
如何检测语言
原标题:How to detect language

是否有良好的开放源发动机来探测案文中哪一种语言,也许有可能会衡量尺度? 我能够在当地运行, t谷歌或 B? 我愿在大约1 500万页的奥古拉角文字中检测每一页的语文。

并非所有文件都含有使用拉丁字母的语文。

最佳回答

视你的工作情况,你可能要检查假日自然语言处理工具包,该工具包对拜斯人学习阿尔戈斯有一些支持。

总的说来,文字和文字频率可能是最快的评价,但是如果你需要做除确定语言以外的任何事情的话,NLTK(或整个海湾学习算法)可能是有益的。 如果你发现头两种方法的错误率过高,那么巴塞尼亚的方法可能也会有用。

问题回答

就未来而言,第一发动机使用的是BSD许可证规定的校正文,但自2003年以来似乎并未保持。 尽管如此,它还是做了出色的工作,很容易融入我的工具链。

Try CLD:

安装

export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user

页: 1

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

Gives

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u ENGLISH , language_code=u en , percent=95, score=1675.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u GERMAN , language_code=u de , percent=96, score=1496.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u FRENCH , language_code=u fr , percent=97, score=1134.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u ARABIC , language_code=u ar , percent=97, score=1263.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u Japanese , language_code=u ja , percent=98, score=3848.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0), Detection(language_name=u Unknown , language_code=u un , percent=0, score=0.0)))
Supports 282 languages.

Others

我并不认为你需要任何非常复杂的东西,例如,如果文件是英文,具有很高的确定性,那么只要文件含有最普通的英语话,就只能加以检验,例如:

"the a an is to are in on in it"

如果包含所有这些内容,我想说这几乎是英语。

否则,你可以尝试Rubya s WhatLanguage gem,在Twitter数据分析中使用的是冰和简单,我是用的。 检查::http://www.you Programme.com/watch?v=lNqZ2cqOReo&list=UJ_3fstMOH-g4yBxtvgkw&index=0&fe;pcpla 快速行动

http://github.com/wooorm/franc“rel=“nofollow” Franc on Github。 该文件在 Java文中撰写,因此,你也可以在浏览器中使用,也可能在诺德使用。

  • franc supports more languages than any other library, or Google;
  • franc is easily forked to support 335 languages; franc is just as
  • fast as the competition.




相关问题
How can I detect if the user is on localhost in PHP?

In other words, how can I tell if the person using my web application is on the server it resides on? If I remember correctly, PHPMyAdmin does something like this for security reasons.

Tracking Right click menu events?

Is there a way to track the right click menu when clicked over a textarea. I would like to know if the user selected cut,copy,paste,select all. Also, I can know when the menu is visible by detecting ...

Recovering from stack overflow on Mac OS X

I am implementing a cross platform scripting language for our product. There is a requirement to detect and properly handle stack overflow condition in language VM. Before you jump in and say make ...

Detecting login credentials abuse

I am the webmaster for a small, growing industrial association. Soon, I will have to implement a restricted, members-only section for the website. The problem is that our organization membership both ...

How to detect page zoom level in all modern browsers?

How can I detect the page zoom level in all modern browsers? While this thread tells how to do it in IE7 and IE8, I can t find a good cross-browser solution. Firefox stores the page zoom level for ...

热门标签