In this article I’ll show my approach to solve real world performance problem by using bash scripting language.
Problem statement is: How to delete 1 million files from file system, and how to do that fast?
File system maintenance is important aspect of sysadmin job.
It’s not only important to maintain file system, but it’s also important to know what limitations some particular file system have.
I’ve been involved when the whole file system (ACFS – which was introduced with 11g version of Oracle database) became slow, almost unresponsive due to large number of files.
To solve the problem I need to delete huge number of log files.
Based on that production experience, I decided to write this article, as I was not able to find something similar by searching the web.
Let’s first generate 1 million small test files that will pretend to be some log outputs from not well maintained file system.
Although you can fine some solutions by using dd and split command, I prefer to use real programming language instead of bash.
For that purpose I’ll use the following Python script:
import os
import sys
for i in range(1000000):
f = open("test" + str(i) + ".txt", "w")
f.write("Test.")
f.close()
Save the script in test directory dedicated for testing and execute the script.
python3 file_generator.py
One warning here. I’ve tested the script on Unix/Linux machine. In case you are on Windows be careful and perform all tests on some unimportant USB drive first.
This is how I’ve tried to remove 1 million files – simply by using regular “rm” command.
$:~/Downloads/xxx>rm -f test*.txt
bash: /bin/rm: Argument list too long
As you can see, I got the “Argument list too long error”.
At this point you’ll probably search the web and find that people are using find command instead of rm.
As find command is very powerful command, next question would be which option I should use?
First I need a way to to find out which files are there to confirm file names and pattern.
In this case I would try to execute something like:
$:~/Downloads/xxx>find . -name test*.txt
bash: /bin/find: Argument list too long
Again the same error as with rm command.
Although -name switch from the find command supports wild card like * or ?, they are not the same as Bash globs.
For that reason I’m always using single quotes like in the following example.
$:~/Downloads/xxx>find . -name "test*.txt" |more
./test18.txt
./test30.txt
./test54.txt
./test59.txt
./test64.txt
./test78.txt
...
This time you will get 1 million files on the screen listed (but not sorted).
Next logical step would be to try with rm command again, but this time by using single or double quotes with wild character inside the quotes.
$:~/Downloads/xxx>rm "test*.txt"
rm: cannot remove 'test*.txt': No such file or directory
$:~/Downloads/xxx>rm 'test*.txt'
rm: cannot remove 'test*.txt': No such file or directory
In both cases rm command is throwing an error. As you can see, this case is not related to globs like in find command.
As all options with rm command options have been exhausted, in the following attempts I’ll use find command combined with some other commands or command line option.
It’s time to introduce exec command to execute an external command from the shell which is rm in this case.
$:~/Downloads/xxx>time find . -name "test*.txt" -exec rm {} \;
real 16m59,131s
user 1m38,099s
sys 14m31,322s
Finally I successfully complete removal of 1 million files from file system.
You can be thinking like – great, I’ve deleted all my files. Previously I was not able to do the same.
Problem solved.
Yes, first problem is really solved but look at the timings.
Almost 17 minutes on laptop with fast SSD disks, Xeon 4 cores CPU and ext4 file system.
Solution works but it’s working slowly.
To figure out why it’s so slow, we must understand how bash command works.
As explained “-exec” means that we are going to execute external command.
What many bash programmer doesn’t know is what escaped semicolon at the end (“\;”) really means.
It means that files from the find command will be inserted in {} placeholder one by one and the command (rm) will be executed once for each file.
In this case rm command will be executed 1 million time, once for each file which is inefficient.
OK, now it should be clear what is the main reason for such terrible performance.
The find command has another option we can use instead of escaped semicolon – plus (“+”) .
Plus sign at the end means that we are passing multiple files to rm command at once and rm command will execute only once or a few times.
First I’ll generate 1 million files again and execute the third test by using plus (“+”) at the end of command.
$:~/Downloads/xxx>python3 file_generator.py
$:~/Downloads/xxx>time find . -name "test*.txt" -exec rm {} +
real 0m14,533s
user 0m1,724s
sys 0m10,891s
What improvement! From almost 17 minutes to only 14.5 seconds.
Can we do better than that?
Yes we can, but we won’t get such a dramatic improvement.
Again I’ll generate 1 million files again and execute the last test by using -delete switch of find command instead of -exec.
$:~/Downloads/xxx>python3 file_generator.py
$:~/Downloads/xxx>time find . -name "test*.txt" -delete
real 0m11,344s
user 0m1,127s
sys 0m8,393s
Result is slightly better – 3 seconds faster from the previous solution.
As I’ve used -delete switch as part of find command, result is expected to be faster as there are no external command execution (majority of elapsed time difference comes from sys – kernel call handling as we have one command less than previously to execute – exec, while user time is almost the same).
What else can be done?
Although I pointed out that all ideas with rm command have been exhausted, let’s see what will happen if I combine find command with rm again, but this time by using xargs instead of exec.
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs rm
real 0m11,786s
user 0m1,629s
sys 0m9,410s
As you can see, result is better comparing it with find/exec/+ combination, but still a bit slower than find/delete combination.
Can we do even better find/delete combination?
To answer that question let’s first check what limits are in place for xargs command on my system.
$:~/Downloads/xxx>xargs --show-limits
Your environment variables take up 6648 bytes
POSIX upper limit on argument length (this system): 2088456
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2081808
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647
From available parameters first I’ll try to increase the number of arguments for xargs (-n 2088456 switch) to the maximum and see if there will be any positive impact.
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -n 2088456 rm
real 0m11,409s
user 0m1,604s
sys 0m9,183s
Result is almost identical as find/delete combination, which is the best result so far.
Next attempt is to use parallel option (-P 2 switch) of xargs command.
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -P 2 rm
real 0m9,781s
user 0m1,624s
sys 0m10,646s
By using the default number of arguments – xargs -n option, with 2 parallel processes we have by far the best result so far: 9 sec versus 11.3 sec for find/delete combination.
In the next test I’m going to increase the number of parallel processes to 4 (-P 4 switch combined with -n 2088456), which is actual number of cores on my system – not the number of threads), but increasing the upper limit of number of arguments.
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -n 2088456 -P 4 rm
real 0m6,681s
user 0m1,782s
sys 0m14,492s
Result that I got is almost 2x times faster than find/delete, and 50% faster than find/xargs/parallel 2 combination.
Next test is to use defaults for number of arguments (-n switch) in the list and use the same parallel 4 (-P 4 switch).
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -P 4 rm
real 0m6,162s
user 0m1,788s
sys 0m14,307s
Result is again slightly improved over the previous one.
I’ll try with two more tests: one with 16 and the other one with 8 parallel processes (-P 16 & -P 8 switches).
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -P 16 rm
real 0m7,453s
user 0m1,819s
sys 0m18,137s
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -P 8 rm
real 0m9,337s
user 0m1,841s
sys 0m19,070s
In both cases I got slower result than with parallel 4.
To find out the reason why you need to know what is the overhead of parallelism, and you need to be able to find bottlenecks both in command that executes and kernel itself.
You can continue to play around by using different combination of number of parameters combined with the level of parallelism, but on my system I haven’t got much better result.
One important point to mention here.
If you compare sys time (time spent on kernel) for various scenario, in the last two tests with parallel 8 & 16, 18 & 19 seconds have been spent on kernel time as I run tests with higher level of parallelism.
The concept is the same as with databases. Parallel option is never scalable.
It’s just a brute force when there are no other way to improve performances, but you have a powerful machine at your disposal.
With parallel 4 kernel time is 14.3 sec and real elapsed time is 6.1 sec which is 50% faster than parallel 8 or 16.
The same happens with Oracle or any other database.
At some point (depending on available resources, Db configuration and number of records to process), by increasing parallelism, performances are getting worse, while someone not familiar with the concept expect execution to be faster.
What is guaranteed to happen by increasing degree of parallelism is to get increase in resource consumption as you can see from all examples provided in the article (visible through sys time).
By increasing level of parallelism kernel needs to handle all those parallel processes by splitting the load across child processes etc.
For smaller number of file (e.g. 100000), time for single threaded find/delete command will be even faster than find/xargs/parallel option.
I generate 100K files and repeat best serial combination of commands (find/delete) with best parallel result (find/xargs/parallel 4)
$:~/Downloads/xxx>time find . -name "test*.txt" -delete
real 0m0,081s
user 0m0,008s
sys 0m0,069s
$:~/Downloads/xxx>time find . -name "test*.txt" | xargs -P 4 rm
real 0m0,082s
user 0m0,014s
sys 0m0,080s
Serial execution is even a millisecond faster than parallel.
Even with 1 million files, 11.3 sec for find/delete or 11.3 and 11.7 sec with find/xargs all running in serial manner doesn’t sound bad, so the question is do we need parallelism at all?
Answer depends on SLA, available resources, system load, hard drive performances/configuration, file system, performance improvement that we can get and many other elements that compose the whole picture of your production system.
Next questions you may ask is: When I need to stop with tuning?
Answer is when you find the balance between results that you get (you meet SLA or you are happy with speed accomplished) and resource consumption.
In all examples provided so far we can observe only 2 radical improvements:
– solving the problem with inability to delete 1 million files
– performance improvement from almost 17 minutes for find/exec/rm to 11.3 sec for find/delete or find/xargs/rm commands.
Improvement from 11.3 to 6.1 sec by introducing parallelism with find/xargs/parallel is nice to have, but it’s not radical improvement at the cost of increased resource consumption.
Summary:
Although people believes that bash is very simple to learn, there are many books around that can help you to acquire required knowledge and to teach you to use it properly.
As you can see from this article, it’s not only about bash as scripting language itself.
You need to know limitations of the most important commands to be able to find the right way to solve the problem.
For that reason I encourage you to keep learning to squeeze the maximum performance and scalability you can have from technology you are using to solve the problem you have.
And what is important to remember: optimal solution and performance tuning as a discipline which is never more important that it’s today, is all about how to get the maximum performances at minimal cost (resource consumption in this case).
Thus optimal solution is not the fastest solution, it’s balance between good performance and minimal resource consumption that will allow you to scale (add more load) without affecting existing performances.
It’s not beneficial that you improve some process by couple of seconds by using parallelism, if you spent all available resources leaving other users to wait till your tuning complete.
Having global picture of system under tuning is the most important knowledge you may need.
In this case I will most probably use one of two available single threaded, serial options: find/delete or find/xargs/rm option as with those option I accomplish all goals:
– deleting 1 million files from file system
– radical improvement in elapsed time (from 17 minutes to 11 sec) with minimum resource consumed.
Comments