I don t know how portable GNU Gawk s filefuncs
extension is. The basic syntax is
time gawk -e @load "filefuncs"; BEGIN {
fnL[1] = ARGV[ARGC-1];
fts(fnL, FTS_PHYSICAL, arr); print "";
for (fn0 in arr) {
print arr[fn0]["path"]
" :: "arr[fn0]["stat"]["size"]; };
print ""; } genieMV_204583_1.mp4
genieMV_204583_1.mp4 :: 259105690
real 0m0.013s
ls -Aln genieMV_204583_1.mp4
---------- 1 501 20 259105690 Jan 25 09:31
genieMV_204583_1.mp4
That syntax allows checking multiple files at once. For a single file, it s
time gawk -e @load "filefuncs"; BEGIN {
stat(ARGV[ARGC-1], arr);
printf("
%s :: %s
", arr["name"],
arr["size"]); } genieMV_204583_1.mp4
genieMV_204583_1.mp4 :: 259105690
real 0m0.013s
It is hardly any incremental savings. But admittedly it is slightly slower than stat
straight up:
time stat -f %z genieMV_204583_1.mp4
259105690
real 0m0.006s (BSD-stat)
time gstat -c %s genieMV_204583_1.mp4
259105690
real 0m0.009s (GNU-stat)
And finally, a terse method of reading every single byte into an AWK array. This method works for binary files (front or back makes no diff):
time mawk2 BEGIN { RS = FS = "^$";
FILENAME = ARGV[ARGC-1]; getline;
print "
" FILENAME " :: "length"
"; } genieMV_204583_1.mp4
genieMV_204583_1.mp4 :: 259105690
real 0m0.270s
time mawk2 BEGIN { RS = FS = "^$";
} END { print "
" FILENAME " :: "
length "
"; } genieMV_204583_1.mp4
genieMV_204583_1.mp4 :: 259105690
real 0m0.269
But that s not the fastest way because you re storing it all in RAM. The normal AWK paradigm operates upon lines. The issue is that for binary files like MP4 files, if they don t end exactly on
, the summing of length + NR
method would overcount by one. The code below is a form of catch-all by explicitly using the last 1-or-2-byte as the line-splitter RS
.
I found that it s much faster with the 2-byte method for binaries, and the 1-byte method it s a typical text file that ends with newlines. With binaries, 1-byte one may end up row-splitting far too often and slowing it down.
But we re close to nitpicking here, since all it took mawk2
to read in every single byte of that 1.83 GB .txt file was 0.95 seconds, so unless you re processing massive volumes, it s negligible.
Nonetheless, stat
is still by far the fastest, as mentioned by others, since it s an OS filesystem call.
time mawk2 BEGIN { FS = "^$";
FILENAME = ARGV[ARGC-1];
cmd = "tail -c 2 ""FILENAME""";
cmd | getline XRS;
close(cmd);
RS = ( length(XRS) == 1 ) ? ORS : XRS ;
} { bytes += length } END {
print FILENAME " :: " bytes + NR * length(RS) } genieMV_204583_1.mp4
genieMV_204583_1.mp4 :: 259105690
real 0m0.092s
m23lyricsRTM_dict_15.txt :: 1961512986
real 0m0.950s
ls -AlnFT "${m3t}" genieMV_204583_1.mp4
-rw-r--r-- 1 501 20 1961512986 Mar 12 07:24:11 2021 m23lyricsRTM_dict_15.txt
-r--r--r--@ 1 501 20 259105690 Jan 25 09:31:43 2021 genieMV_204583_1.mp4
(The file permissions for MP4 was updated because the AWK method required it.)