English 中文(简体)
Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3
原标题:

I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command

bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3>

It throws the following errors (not at the same time.) The first error is thrown when i don t replace the slashes with %2F and the second is thrown when i replace them with %2F :

1) Java.lang.IllegalArgumentException: Invalid hostname in URI S3://<ID>:<SECRETKEY>@<BUCKET>/<path-to-inputfile>
2) org.apache.hadoop.fs.S3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for  /  XML Error Message: The request signature we calculated does not match the signature you provided. check your key and signing method.

Note:

1)when i submitted jps to see what tasks were running on the Master, it just showed

1116 NameNode
1699 Jps
1180 JobTracker

leaving DataNode and TaskTracker.

2)My Secret key contains two / (forward slashes). And i replace them with %2F in the S3 URI.

PS: The program runs fine on EC2 when run on a single node. Its only when i launch a cluster, i run into issues related to copying data to/from S3 from/to HDFS. And, what does distcp do? Do i need to distribute the data even after i copy the data from S3 to HDFS?(I thought, HDFS took care of that internally)

IF you could direct me to a link that explains running Map/reduce programs on a hadoop cluster using Amazon EC2/S3. That would be great.

Regards,

Deepak.

最佳回答

You can also you Apache Whirr for this workflow. Check the Quick Start Guide and the 5 minutes guide for more info.

Disclaimer: I m one of the committers.

问题回答

You probably want to use s3n:// urls, not s3:// urls. s3n:// means "A regular file, readable from the outside world, at this S3 url". s3:// refers to an HDFS file system mapped into an S3 bucket.

To avoid the URL escaping issue for the access key (and to make life much easier), put them into the /etc/hadoop/conf/core-site.xml file:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>0123458712355</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>hi/momasgasfglskfghaslkfjg</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>0123458712355</value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>hi/momasgasfglskfghaslkfjg</value>
</property>

There was at one point an outstanding issue with secret keys that had a slash -- the URL was decoded in some contexts but not in others. I don t know if it s been fixed, but I do know that with the keys in the .conf this goes away.

Other quickies:

  • You can most quickly debug your problem using the hadoop filesystem commands, which work just fine on s3n:// (and s3://) urls. Try hadoop fs -cp s3n://myhappybucket/ or hadoop fs -cp s3n://myhappybucket/happyfile.txt /tmp/dest1 and even hadoop fs -cp /tmp/some_hdfs_file s3n://myhappybucket/will_be_put_into_s3
  • The distcp command runs a mapper-only command to copy a tree from there to here. Use it if you want to copy a very large number of files to the HDFS. (For everyday use, hadoop fs -cp src dest works just fine).
  • You don t have to move the data to the HDFS if you don t want. You can pull all the source data straight from s3, do all further manipulations targeting either the HDFS or S3 as you see fit.
  • Hadoop can become confused if there is a file s3n://myhappybucket/foo/bar and a "directory" (many files with keys s3n://myhappybucket/foo/bar/something). Some old versions of the s3sync command would leave just such 38-byte turds in the S3 tree.
  • If you start seeing SocketTimeoutException s, apply the patch for HADOOP-6254. We were, and we did, and they went away.

Try using Amazon Elastic MapReduce. It removes the need for configuring the hadoop nodes, and you can just access objects in your s3 account in the way you expect.

Use

-Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key>

e.g.

hadoop distcp -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>

or

hadoop fs -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>




相关问题
Hadoop - namenode is not starting up

I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...

Hadoop: intervals and JOIN

I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...

热门标签