English 中文(简体)
定期发表言论,从 t中抽出字、姓名、标签和短语
原标题:Regular Expression to extract words, names, hashtags, and phrases from tweets
  • 时间:2012-01-13 18:42:37
  •  标签:
  • c#
  • regex

我与前线合作,在各种信条中打出言词、名称、标签和短语。

我假定名字是几句,从英文大写字母开始,斜杆是除空间外一切的,词语是引述的,言辞是言辞。

It would also be nice to pull out any links too, but that is not necessary.

我要使用雷克斯,但如果有一个更好的解决办法,我想知道。

简称:

You know you watch a lot of Wes Anderson films when you see his new trailer and think, "Wait, where s the Futura font?" #MoviesILike http://bit.ly/HklUk

Regex 我现在说:

Regex _wordRegex = new Regex(@"(?:""(?<Item>.*?)"")|(?<Item>(?:[A-Z][a-z]*?[.s])+)|(?<Item>#S+)|(?<Item>w+)");
最佳回答

我先谈一下我相当份额的垂直数据。 我发现,最佳办法是通过白天传播的信息,然后逐条分析。 这一工作非常好......请看一下:

@bobjones let s go watch the game at @hooters #nfl #broncos #tebow

关于<代码>@和#标的,你只得检查第一个特性。 对URLs来说,你可能希望与那里的监管机构做一些事情。 基本上:

if token[0] ==  @  then mention
else if token[0] ==  #  then hashtag
else if token looks like a url then url
else then word

我认为,没有必要使本案中的事务复杂化。 尤其是因为你正在从同一处提取不同类型的物品。

你在引言中提到了......你可能希望把这个问题作为象征性化的一个独角兽案件来处理。

问题回答

我发现,上述答案是,如果你没有校准或其他杂质,却对 has子进行右.。 例如。 象pprogramming一样,可以成功地标出,但。 如同图形一样,右边吗?将产生错误识别的标签:#programming,

解决这一问题有多种途径。 我建议采用一种比较性的做法来审视每一种特性。 代价较慢,但更准确。

string raw = "hello this is #Totally #Awesome, right? #yeah!";
List<string> hashtags = new List<string>();
StringBuilder sb = null;

foreach (char c in raw.ToLower())
{
    if (c ==  # )
    {
        sb = new StringBuilder();
        track = true;
    }
    else if (track)
    {
        if (char.IsLetterOrDigit(c))
        {
            sb.Append(c);
        }
        else
        {
            hashtags.Add(sb.ToString());
            track = false;
        }
    }
}

if (track)
{
    hashtags.Add(sb.ToString());  // Make sure to grab the last one!
}

它剥夺了散列象征(因此,你不会用第######号或某种东西结束),但你应当获得。

a. 完全、荒谬、确实





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签