The ATLAS implementation has a single ‘Clients’ Site Collection containing over 30,000 subsites (each corresponding to a particular Client or Customer. Each Client sub-site has between 2 and 10 Document Libraries, each corresponding to a particular Project or Matter or Assignment. In total there are over 165,000 Document Libraries. Documents and emails have been loaded into all 165,000 of these Document Libraries. In total over 3.2 Million documents and emails (total volume approximately 100GB) were loaded to ALTAS.
The resulting ATLAS implementation successfully demonstrated that a document management solution created using MacroView DMF + SharePoint 2010 can perform well when used with a large-scale SharePoint document store. It also provided a number of valuable insights that are relevant to organizations looking to migrate their existing documents and emails to SharePoint.
The upload of documents and recording of metadata in the ATLAS SharePoint environment was handled by custom-developed scripts, which in turn relied on reusable code from the MacroView Document Management Framework. Some of this code is also used by the MacroView DMF product – e.g. to record metadata for uploaded files. Other parts of the code are specific to migration – e.g. dynamically provisioning new Sites and Libraries in the SharePoint document store.
To populate the ATLAS document store with realistic email messages we loaded the Enron Email Data Set (ZL Technologies, Inc. - http://www.zlti.com). The MacroView Standardiser utility handled the naming of the resulting MSG files (so as to prevent duplicates) and the recording of email attributes (such as To, From, Subject, SentOn, ReceivedTime, etc) in like-named metadata columns.
It takes time to upload a document or mail and to record its metadata so to shorten the overall load time we used parallel upload threads – at any one time multiple upload jobs were running. We provisioned the SharePoint Site / Library tree before running the parallel jobs that uploaded documents and recorded metadata. After some experimentation we found that there was a ‘sweet spot’ for the number of upload processes being run simultaneously. Too many led to competition for some key resources, notably the SQL Server, and actually reduced total throughput. The best total upload rate we achieved was approximately 40,000 documents per hour, meaning a total of 25 hours to upload 1 million documents.
One of the more valuable insights that we gained during the upload process was related to the Change Log. As each document was uploaded, SharePoint 2010 wrote multiple entries to its Change Log. As the volume of entries in the Change Log grew, the upload rate progressively declined. Indexing of the uploaded documents for search was also made slower by the presence of so many Change Log entries. We addressed this by frequently clearing the Change Log. In future large-scale uploads we would look to preventing any Change Log entries from being written.
Other valuable insights related to Unique Document Numbering. Initially we opted to upload the documents to SharePoint and then run the Document ID assignment job to process the already-uploaded files. In the course of the upload project it became clear that the best approach was to assign Unique Document IDs as the files were being uploaded.
Using the Unique Numbering Provider mechanism of SharePoint 2010 drives up the number of entries in the Event Receiver Table in the underlying SQL Content Database. This in turn can lead to a significant degrading of performance of tasks such as Check In / Check Out if the SharePoint document store contains a large number of Document Libraries. This was the case with the ATLAS implementation, which was designed to reflect a large Professional Services organisation. We were able to address this degradation by relatively simple tuning of the SQL used by SharePoint.