English 中文(简体)
Spam detection in (objective-) C
原标题:

I m currently writing an iPhone application which gets some data from the user and uploads it to a server. The uploaded data will be displayed to other users of the same program (there s more to it than that, but to keep the idea simple...). The data which is uploaded is basically just three strings: a name(max. 50 char.), a title(max. 50 char.) and some text(virtually unlimited char.). What I need is basically a function, service or algorithm which can detect how valid the data input is. It would have to be able to detect series of repetitive characters, certain illegal words, abnormal whitespaces, etc. So my questions is; is there a C or Objective-C library (build-in or open source) for this sort of data validation, or else, how would I go about doing this kind of check?

Here are two examples of good and bad data:

GOOD:

Name: "John Aaron Smith"  
Title: "Why am I still here?"  
Text: "Can anybody please help me? I m feeling lonely!"

BAD:

Name: "f**k you, kldsanfklds"   
Title: "Only $99. Buy Now. Only $99"  
Text: "ndsaklgnvds lakævndsaklæfhadsæhdsjka fhdskjafhdskj lafhsdkhf. €#&/ #&()(/&%& ># €%€#% €#& hidosæahviædshvidshfiodsa. adsifjDSILFJIDSH 


















"

I know taking precautions for so many cases will be difficult, but this algorithm/library would just have to filter the worst spam. I will also be looking through the data before the final database submission, but of course the less spam, the easier I ll have it.

Yours, BEN.

EDIT: My most fluent language is objective-C, but I m also doing pretty well with C, and I have knowledge of PHP and JAVA. Libraries/examples in other languages might be difficult for me to understand, and translate into a valid iPhone language.

EDIT-EDIT: I m not looking for something overly sophisticated. Just a simple way for me to do the rough cut.

最佳回答

This is a very difficult problem to solve. I would not attempt to create my own spam detection, I would use a solution which already exists and has a good reputation, such as SpamAssassin.

问题回答

Have you seen Mollom? It has a bunch of developer libraries (php, ruby, perl, etc) that communicate with the Mollom servers to determine the spaminess of an entry. It wouldn t be hard to translate one of those to Objective-C.

I ve made something similar to what you want but it s in php. All the text I deal with is entered with a captcha so what I m blocking is useless comment spam similar to your bad example. Here s what I ve got so far which has been blocking a good 80% of the junk. It may block some valid text from people with bad spelling habits but I prefer that over manually editing text.

  1. check that the text is not empty and verify that it s not all spaces
  2. Check the length, I use a minimum of 3 characters.
  3. check for series of matching characters e.g. !!!!!! I use no more then 3.
  4. check for words longer then 15 characters. e.g. lakævndsaklæfhadsæhdsjka
  5. convert a copy of the text to lowercase and run through a dictionary of bad words

You could add to this by blocking text with suspicious characters e.g. %^[] additionally you could compile a list of characters that should never be used next to each other e.g. fd, gf, kp, yt, vnd At this point you need to automate by adding to the algorithm. This would mean that the algorithm needs to understand some grammar and the overall process will begin to multiply in intensity. Anything else is beyond my comprehension at this point.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签