English 中文(简体)
利用灰色或其他工具搜索大量的CSV文档
原标题:Using grep or other tools for searching a large CSV file

我有4名CSV大队,试图寻找一组CSV。 我有一卷卷宗,其中载有Im搜索的关键词(这些关键词将列在大呼号的第一栏)。

我尝试了这条路,但最后需要一个小时。 我需要tr,以摆脱窗户回包。

LC_ALL=C grep -F -i -f <(tr -d  
  < keywords.csv) big_csv.csv > output.csv

我能否在这方面取得最佳效果? 失踪的Im 最好利用 a或另一种工具来做到这一点? 我甚至想到的是,先把大钟分开,这样一来,我就只能用档案名称搜索关键词,然后用新的档案。 是否有这方面的最佳做法? I m 尽量使这一点成为PPOSIX

此处所要求的是一些抽样数据。

ADLV,-1.741774,0.961072,-0.751392,-0.935572,-2.269994,1.081103,-0.831244,1.540083,0.474326,-1.322924,2.199037,-0.919939,0.641496,-0.584152,0.729028,0.608351,-0.522026,0.966026,-0.793949,-1.623368,1.16177,-0.642438,-0.675811,-0.214964,-2.263053,2.188642,0.302449,0.770106

第一行将有多个条目。

各行各有更多数据,但太长了。

关键词的档案也将如此。

ADLV
ADVG

在大多数关键词中,csv将有1 000个关键词。 每一关键词都将是4封信。

https://gist.github.com/fishnibble/9d95658c352a1acab3c3e965defb3f“rel=“nofollow noreferer”>https://gist.github.com/fishnibble/9d95658c352a3cec3e965defb3f

最佳回答

如果使用任何 a子,就象你们所需要的那样:

awk -F,  NR==FNR{sub(/
$/,""); keys[$1]; next} $1 in keys  keywords.csv big_csv.csv

sub(/$/,")> is there is which because You have tr -d < keywords.csv in their Code - if You t has DOS line ends in You keywords file, when You t need that.

我也看到你在座右边的指挥中有<条码>-i,但这意味着你需要使对案的对应敏感——如果是的话,这就是你需要的话——仍在使用任何 a子:

awk -F,  {key=tolower($1)} NR==FNR{sub(/
$/,""); keys[key]; next} key in keys  keywords.csv big_csv.csv

www.un.org/Depts/DGACM/index_french.htm 你所尝试的行文不仅需要很长的时间才能完成,而且还会产生不正确的产出,因为它跨越了整个大地段,而不仅仅是第一个领域,因此,如果关键词出现在另一行的位置上,就会产生不实的匹配,如果你想要的关键词是一些其他关键词所替代的话,也会产生错误的对应。

问题回答

这就是你怎么做。 我很想知道期限是否延长或缩短。

awk -F,  
  NR==FNR { words[$1]; next }
  $1 in words
  keywords.csv big_csv.csv > output.csv

它通过使用普通“第一档案”测试将第1个卷宗读成一个阵列,然后核对该阵列作为第二份档案的关键。

I can update this further if you add more detail to your question.

假设:

  • first field (both files) does not contain commas and is not wrapped in double quotes
  • data from keywords.csv may have windows/dos line endings ( )

设置:

$ cat keywords.csv                         # run through unix2dos to add "
"
ADLV
ADVG

$ cat big_csv.csv
ADLV,-1.741774,0.961072,-0.751392,-0.935572,-2.269994,1.081103,-0.831244,1.540083,0.474326,-1.322924,2.199037,-0.919939,0.641496,-0.584152,0.729028,0.608351,-0.522026,0.966026,-0.793949,-1.623368,1.16177,-0.642438,-0.675811,-0.214964,-2.263053,2.188642,0.302449,0.770106
WXYZ,-1.741774,0.961072,-0.751392,-0.935572,-2.269994,1.081103,-0.831244,1.540083,0.474326,-1.322924,2.199037,-0.919939,0.641496,-0.584152,0.729028,0.608351,-0.522026,0.966026,-0.793949,-1.623368,1.16177,-0.642438,-0.675811,-0.214964,-2.263053,2.188642,0.302449,0.770106

缩略语 做法:

awk -F,                                    # input field delimiter is a comma; for 1st file this implies entire line == 1st field
FNR==NR { sub(/
$/,""); a[$0]; next }     # 1st file: strip off "
", save line as index in array a[]; skip to next line of input (from 1st file)
$1 in a                                    # 2nd file: if 1st field is an index in array a[] then print current line to stdout
  keywords.csv big_csv.csv > output.csv

This generates:

$ cat output.csv
ADLV,-1.741774,0.961072,-0.751392,-0.935572,-2.269994,1.081103,-0.831244,1.540083,0.474326,-1.322924,2.199037,-0.919939,0.641496,-0.584152,0.729028,0.608351,-0.522026,0.966026,-0.793949,-1.623368,1.16177,-0.642438,-0.675811,-0.214964,-2.263053,2.188642,0.302449,0.770106

也可在鲁比拉这样做:

ruby -e  
# split(/R/) works the same with DOS or Unix line endings
keys=File.open(ARGV[0]).read.split(/R/).map(&:downcase).to_set
File.open(ARGV[1]).each_line{|line| 
    tst=line.split(/,/,2)[0].downcase
    puts line if keys.include?(tst)
}
  sample_keyword.csv sample_input.csv >out.csv

没有部分在生态学中采用基本实例的优势。 事实上,如果实际使用只是你描述的,我会这样做。 www.un.org/Depts/DGACM/index_spanish.htm 较快但往往没有时间。

然而,你可以做像产出那样的工作,采取不同的形式(JSON、XML、复杂csv),在鲁比拉更容易地具有挑战性。

可在鲁比拉内部复制curl,并直接阅读你的好例子:

ruby -e  
require "net/http"
require "uri"

uri1 = URI.parse("https://gist.githubusercontent.com/fishnibble/9d95658c352a1acab3cec3e965defb3f/raw/21fc5153a0b78cdb3eab88c72d700cdf74f20ae7/sample_keyword.csv")
keys = Net::HTTP.get(uri1).split(/R/).map(&:downcase).to_set

# This can be done in a streaming mode for huge data...
uri2 = URI.parse("https://gist.githubusercontent.com/fishnibble/9d95658c352a1acab3cec3e965defb3f/raw/21fc5153a0b78cdb3eab88c72d700cdf74f20ae7/sample_input.csv")
Net::HTTP.get(uri2).split(/R/).each{|line|
    tst=line.split(/,/,2)[0].downcase
    puts line if keys.include?(tst)
}  >out.csv

这正是你们想要的Ruby /ecosoc / 由于这一流很难做到。





相关问题
Signed executables under Linux

For security reasons, it is desirable to check the integrity of code before execution, avoiding tampered software by an attacker. So, my question is How to sign executable code and run only trusted ...

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

How to write a Remote DataModule to run on a linux server?

i would like to know if there are any solution to do this. Does anyone? The big picture: I want to access data over the web, using my delphi thin clients. But i´would like to keep my server/service ...

How can I use exit codes to run shell scripts sequentially?

Since cruise control is full of bugs that have wasted my entire week, I have decided the existing shell scripts I have are simpler and thus better. Here is what I have so far svn update /var/www/...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

热门标签