Question

我有一个文件 File1 ,其中载有这些数据:

NC_009066   5239    5308    trnA(tgc)   2.10899859667e-09   -
NC_009066   5309    5382    trnN(gtt)   7.03000463545e-10   -
NC_009066   5422    5487    trnC(gca)   7.09999799728e-08   -
NC_009066   5487    5557    trnY(gta)   3.72200156562e-11   -
NC_009066   5549    7097    cox1    291081744.81    +
NC_009066   7109    7180    trnS2(tga)  1.83000043035e-09   -
NC_009066   7183    7256    trnD(gtc)   2.5720000267e-09    +

和另一个快文件 < 坚固 > File2

> NC_009066,1,0-17045,
GCTATCGTAGCTTAATTAAAGCATAACACTGAAGATGTTAAGATGAACCCTAGAAA

我将文件 1 放在数列行中, 逐行排列, 然后我就可以在 < code>/ s+/ 上分割每行, 从而访问每列。

for $line(@array){
    @column= split(/s+/,$line);
    # print $column[5]."
";

$gene=substr($seq,$column[1],$column[2]);#$seq extracted from File2....}

but I want to do is to take the second column from the 1st line with the 3rd column from the 2nd line (substr($seq,5239,5382)) and then 2nd column from 2nd line and 3rd column from 3rd line (substr($seq,5309,5487))..... what is the best way to do it ??

Answer 1

首先,请注意, split 的默认效果是将 $/code> 分割成白空格,丢弃引导和尾随空字段。最常见的是, 这是您想要的, 以及 split /s+/ 是不必要的。如果您想要在除 $/code > 以外的变量上援引默认的分割, 您必须传递一个单一的字面空间, 即 < em>not regex, 作为模式参数, 例如 split, $line 。



我建议你首先使用 map  来创建仅包含第二和第三栏中数据的阵列。

然后,你可以环绕数据,提取开始值和结束值,并将基因从序列中提取出来。

代码看起来是这样的

use strict;
use warnings;

open my $fh,  < ,  f1.txt  or die $!;

my @data = map [ (split)[1,2] ], <$fh>;

my $seq =  GCTATCGTAGCTTAATTAAAGCATAACACTGAAGATGTTAAGATGAACCCTAGAAA ;

for my $i (1 .. $#data) {
  my ($start, $end) = ( $data[$i-1][0], $data[$i][1] );
  my $gene = substr($seq, $start, $end - $start);
  print "$gene
";
}


请注意,循环的值高于指数 < code>1  (数组中 < em> second < /em > 元素)至 $#data  (最后一个元素)。这是因为循环的体将  先前的  元素的第一列和当前元素的第二列作为一对,而第一个元素之前没有元素。

请注意,您可能需要将参数调整到 substr,因为我不知道您的指数是从零开始还是从一开始,或者是否包括该指数中的字符。

例如,在 $start = 1; $end = 2 , substr (AT 或 TC ) 实际指 A 或 AT 或 TC 时, 将返回 T 。

Answer 2

您已经了解了一切, 您只是错误地使用 < code> substr 。 < a href=" http:// p3rl. org/substr" rel= "nofollow"\\ code>perldoc -f substr 中的概要写着 :

子增量,FFFSET, 增加

但您却给它两个偏移。相反, 从另一个偏移中减去一个偏移, 以计算正确的长度参数。

Answer 3

使用二维数组:

for (my $i = 0; $i < scalar(@array); ++$i) {
    $$table[$i] = [ split(/s+/,$array[$i]) ];
}

# you may put this into a loop
$start = $$table[0][1];
$end = $$table[1][2] - $$table[0][1];
$gene = substr($seq, $start, $end);

另见"http://perldoc.perl.org/perllol.html" rel="no follow">perllol 。

友情链接