English 中文(简体)
利用PIG和Hadoop,我如何使部分案文与人数不详的群体相对应?
原标题:Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

使用亚马孙弹性地图的Im减少了。

我拥有像这样的记录档案。

   random text foo="1" more random text foo="2"
   more text notamatch="5" noise foo="1"
   blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ...

我怎么能写大句,以摘除 f言中的所有数字?

我更喜欢这样一些的les子:

(1,2)
(1)
(1,3,4)

我尝试如下:

TUPLES = foreach LINES generate FLATTEN(EXTRACT(line, foo="([0-9]+)" ));

但是,这一结果在每一行中只达到第一种:

(1)
(1)
(1)
问题回答

http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#STRSPLIT”rel=“nofollow” http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#STRSPLIT。

The regex to split on would be [^0-9]+ (i.e., not numbers) This will effectively split on large portions of non-numbers, leaving only tokens of numerical digits.

另一种选择是写一个Pig UDF。

REGEX_EXTRACT 职能可能有助于您获得预期产出。

REGEX_EXTRACT(投入、oo=(*)) ,2) AS投入;





相关问题
Mount windows shared drive to MWAA in bootscript

In MWAA startup script sudo yum install samba-client cifs-utils -y sudo mount.cifs //dev/test/drop /mnt/dev/test-o username=testuser,password= pwd ,domain=XX Executing above commonds giving error - ...

How to get Amazon Seller Central orders programmatically?

We have been manually been keying Amazon orders into our system and would like to automate it. However, I can t seem to figure out how to go about it. Their documentation is barely there. There is: ...

Using a CDN like Amazon S3 to control access to media

I want to use Amazon S3/CloudFront to store flash files. These files must be private as they will be accessed by members. This will be done by storing each file with a link to Amazon using a mysql ...

unable to connect to database on AWS

actually I have my website build with Joomla hosted on hostmonster but all Joomla website need a database support to run this database is on AWS configuration files need to be updated for that I ...

Using EC2 Load Balancing with Existing Wordpress Blog

I currently have a virtual dedicated server through Media Temple that I use to run several high traffic Wordpress blogs. Both tend to receive sudden StumbleUpon traffic surges that (I m assuming) ...

SSL slowness in EC2

We ve deployed our rails app to EC2. In our setup, we have two proxies on small instances behind round-robin DNS. These run nginx load balancers for a dynamically growing and shrinking farm of web ...

热门标签