Optimizing Automated File Processing

We often get asked what is the best way to optimize automated file processing or users are just not sure as to what settings are best for them.  In order to determine this there are a couple factors to take into consideration:

  1. Is your processing offloaded to a tail server?
  2. Is the server dedicated to just processing or are there other pieces of software running on this?
  3. When are you processing documents? 24 hours a day? Only during non-production hours?
After taking these into consideration, the best scenario for optimizing your automated file processing is to offload the processing to a tail server and have that server dedicated only to processing.  In the real world this isn't possible due to hardware limitations, but with most users going to VMs of some kind this is getting easier and easier.  If you are not processing on a tail server that is ok because what I am about to describe below will help you as well.  
The three major settings in how fast your processor is going to go are located in the ProjectWise Administrator under Document Processors.  Below is a brief description of each:
  1. Retry extraction in (minutes).  If a document hasnt processed in the amount of time specified, it will try to process it again.  So if the retry time is set too low and the document takes longer then that to process, the output is ignored because the retry time ran out.
  2. Max documents processed in a single pass: This setting determines how many documents are passed to the file processors on each pass.  The system will not add more documents to the queue if the queue is above this number.  For example, if you set the documents to 100, the first pass 100 are added, on the second pass if 80 remain it is lower then 100 so the scheduler adds another 100 to the queues, so now you have 180 documents in the queue.  If the third pass there are 150 documents in the queue no documents are added because it is above the 100 documents.  This is based on overall queue size, so if you have 20 datasources at 100 documents each, you will not have 2000 documents being added to the que.  It will never go above 100 documents.
  3. Check for updated documents every (minutes):  This determines how long ProjectWise will start another job, this is from when the job before this was started.  So if one job starts, regardless if the processing is done or not ProjectWise is going to pass another set of documents that is predetermined in #2 above in whatever value you have set here.
One major problem we see with users is that they set the Max documents processed in a single pass to a very high number, like 1000.  This is not something that we recommend as this tends to clog up the system because most user's servers cannot process 1000 documents in the default time.  This is the first setting that we normally change when users have problems with their Automated File Processing.  This number is best set to something low. 
The best way to think of the Automated File Processing is like a production line, you want to configure it so that it is processing just enough documents in the time allocated.   A good starting point for this is with the following settings:
  1. Retry extraction in (minutes).  1440
  2. Max documents processed in a single pass: 300
  3. Check for updated documents every (minutes):  3
With these settings above what will happen is that documents marked for processing will be sent to the Automated File Processing in batches of 200 every 3 minutes, and any failed documents will process again in 1440 minutes.  With setting this to 1440 if a document fails for whatever reason it will go back through again in 1440 minutes, this allows new documents to be processed faster then failed documents.
After you have this baseline the best way to adjust from there is to look at your Orchestration Framework and see if the queues are zero.  If the queues are reaching zero that means that your jobs are finishing before the time above.  The best way to adjust this is to turn up Max documents processed in a single pass by 50, check the queues and if they go to zero again turn it up by 50 again.  Follow this process until the queues always have documents in them.  This monitoring process may take a week or two to get perfected, but it is possible.  After giving these suggestions to a user who had the file processing offloaded to a tail, they went from processing ~2,000 documents a day to processing around the clock at ~60,000 documents per day and processed over a million documents in a couple weeks time.  Now mind you, this took several weeks of adjusting the numbers above with offloading the processing to a tail where the only thing this tail did was process documents.  Most users will not get these same results if you do not have the resources for this, but it is possible.  If you have any further questions please call into TSG and we will be happy to explain this further or assist you in making these changes. 
Anonymous