Question

我对Perl来说相对较新,只是将其用于将小档案转换成不同格式和在方案之间输入数据。

现在,我需要稍加加强。我有一份DNA数据档案,长5 905条线,每个线32个田。这些田地没有受到任何限制,界线内长短不一,但所有5905条线的面积相同。

我需要每一条线与档案分开,每一条线以本身变量储存在线内。我没有任何问题储存一条线,但我难以通过整个档案连续地储存每一条线。

这就是我如何将全部阵列的第一行分离为个别变量:

my $SampleID = substr("@HorseArray", 0, 7);
my $PopulationID = substr("@HorseArray", 9, 4);
my $Allele1A  = substr("@HorseArray", 14, 3);
my $Allele1B = substr("@HorseArray", 17, 3);
my $Allele2A  = substr("@HorseArray", 21, 3);
my $Allele2B = substr("@HorseArray", 24, 3);

......

我的问题是: (1) 我需要把5905条线路中的每一条作为单独的阵列。 (2) 我需要能够参照每个线,以抽样身份识别为依据,或以人口识别为依据的一组线,并加以分类。

一旦在变数中界定数据,我就可以对数据进行分类和操纵,我正忙于与每个领域建立多层面的阵列,以便我能够随时参考每个线。任何帮助或方向都受到高度赞赏。我已接过Q&A节,但我尚未找到对我问题的答复。

Answer 1

Do not 储存每一条线,并储存在自己的阵列中。你们需要建立一个数据结构。开始阅读以下教学表格:

这里有一些开端守则:

use strict;
use warnings;

# Array of data samples. We could use a hash as well; which is better 
# depends on how you want to use the data.
my @sample;

while (my $line = <DATA>) {
    chomp $line;

    # Parse the input line
    my ($sample_id, $population_id, $rest) = split(/s+/, $line, 3);

    # extract A/B allele pairs
    my @pairs;
    while ($rest =~ /(d{1,3})(d{3})|(d{1,3}) (d{1,2})/g) {
        push @pairs, {
            A => defined $1 ? $1 : $3,
            B => defined $2 ? $2 : $4,
        };
    }

    # Add this sample to the list of samples. Store it as a hashref so
    # we can access attributes by name
    push @sample, {
        sample     => $sample_id,
        population => $population_id,
        alleles    => @pairs,
    };
}


# Print out all the values of alleles 2A and 2B for the samples in
# population py18. Note that array indexing starts at 0, so allele 2
# is at index 1.
foreach my $sample (grep { $_->{population} eq  py18  } @sample) {
    printf("%s: %d / %d
",
        $sample->{sample},
        $sample->{alleles}[1]{A},
        $sample->{alleles}[1]{B},
    );
}

__DATA__
00292-97 py17 97101 129129 152164 177177 100100 134136 163165 240246 105109 124124 166166 292292 000000 000000 000000
00293-97 py18 89 97 129139 148154 179179 84 90 132134 167169 222222 105105 126128 164170 284292 000000 000000 000000
00294-97 py17 91 97 129133 152154 177183 100100 134140 161163 240240 103105 120128 164166 290292 000000 000000 000000
00295-97 py18 97 97 131133 148162 177179 84100 132134 161167 240252 111111 124128 164166 284290 000000 000000 000000

Answer 2

我先通过行道,把每个区划入一个田地,然后按行方式为每个指数建造一个 has。

my %by_sample_id;           # this will be a hash of hashes
my %by_population_id;       # a hash of lists of hashes
foreach (<FILEHANDLE>) {
    chomp;  # remove newline
    my %h;  # new hash
    $h{SampleID} = substr($_, 0, 7);
    $h{PopulationID} = substr($_, 9, 4);
    # etc...

    $by_sample_id{ $h{SampleID} } = \%h;   # a reference to %h
    push @{$by_population_id{ $h{PopulationID} }}, \%h;  # pushes hashref onto list
}

然后,你可以使用指数获取你重新感兴趣的数据:

say "Allele1A for sample 123123: ", $by_sample_id{123123}->{Allele1A};
say "all the Allele1A values for population 432432: ", 
     join(", ", map {$_->{Allele1A}} @{$by_population_id{432432}});

Answer 3

I m going to assume this isn t a one-off program, so my approach would be slightly different. I ve done a fair amount of data-mashing, and after a while, I get tired of writing queries against data structures.

因此

我将把数据输入SQLite数据库(或其他 sql DB),然后书写利用Perl DBI,从中查询。这把复杂性推向了过去一个简单的教区,但是,在你撰写了数篇对相同数据进行查询的文字之后,显然这是一种痛苦,必须有一个更好的途径<>。

You would have a schema that looks similar to this create table brians_awesome_data (id integer, population_id varchar(32), chunk1 integer, chunk2 integer...);

然后,在你使用一些暴民和迈克尔的优异教区之后,你 lo,并做一些INSERT INTO, 贵重数据表。

然后,你可以利用CLI进行您的“选举......”查询,以便迅速获得你所需要的数据。

或者,如果分析/管道更加严密,你可以携带书目和书目表,并将数据输入你的分析常规。

我相信,这样做的最好办法莫过于对数据结构内外的查询。

友情链接