我有一个像这样的 p子方案:
- Read a row from a csv file.
- Do some transformations on it.
- Break it up into the actual rows as they would be written to the database.
- Write those rows to individual csv files.
- Go back to step 1 unless the file has been totally read.
- Run SQL*Loader and load those files into the database.
第6步确实需要很多时间。 看来是第4步,占用了大部分时间。 大部分情况下,我要最优化地处理在拥有一定类型的区域援助署设置的四分服务器上运行的数百万低收入者的一套记录。
我必须解决以下几个想法:
- Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn t?
- Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
- Break the load file up into separate chunks and process them in parallel. The rows don t need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.
当然,这个问题的正确答案是“通过测试发现最快的东西”。 然而,我主要试图了解我应首先花在哪里。 是否有人对这些事项拥有更多经验?