| || |
Automate comparison of lacs of xml files
I have a requirement where i need to compare lacs of xml files in the range of 3,00,000 to 10,00,000 with corresponding pairs.
Please provide your suggestions on which tool / setup / OS would be the best way to achieve it.
I have a working model ready with me with Excel VBA which compares it and reports it out in excel but thats being used for 100-200 files.
I want to know if VBA can scale up to take this load or else the bottom would be my questions:
1. which tool to use which can scale up to compare lacs of files
2. Speed is not a constraint, even if it takes a day as long as it can scale up, be stable and report it out
3. Reporting mechanism (i have excel reporting in place, dont know if that would survive)
Every xml node in each file needs to be reported, we have around 60 nodes per file
I can manipulate excel to report across different sheets/ different files etc but dont know if thats a good approach
Also if i need to test this, whats the best way to generate a source of 3,00,000 pairs of source and destination files thats something burning inside my head again
Please provide your suggestions on the same
Few options that I could think of:
1. If you already have a Excel Macro code which does it for 200 file pairs, then you can possible divide the 3 lacs files into group and use the same script. Assuming your macro code generates the reports in the format which you actually need.
2. XML comparison can be done in various ways, depending on how you would like to compare. Like will the case of the elements value matter, will you support threshold comparison, if duplicate node elements fine, does child order matter and so on and so forth.
There are several readymade tools/technologies that does XML compare like BeyondCompare, Araxis, Microsoft's XMLDiff Tool, Java and other programming language has API's to work on XML which you can use to parse and compare them.
1. With regards to point number 1 , i can try that out but before i do had the following questions in mind:
a. Is there a way to generate such a huge volume of files in 2 folders in the 1st place for testing
b. Any speed up tricks in excel i should keep in mind, i feel this might take days with a single threaded approach using excel
2. For point number 2, i am not worried about logic since we already have what business expects off us and we have it signed off
Real concern over excel is:
1. How much time it would take for the report to generate (my 100 files takes around 25 minutes with no performance stuff added)
2. Will excel even open up once i put say 10,00,000 rows of data in Sheet1, use total 20 sheets and move on to the next Workbook
Whats the maximum load one has seen for excel files
10,00,000 X 20 = Total rows of data i am expecting
Can i put these in 20 worksheets in 1 WorkBook and feel safe of being able to retrieve this data?
Excel maximum limit is 65,536 rows by 256 columns. Could work with straight .csv format though.
You'd definitely want to be doing this with a "real" programming language like Java not VBA.
Thanks sure m gonna try porting the vb code to java
Excel 2007 has 10,48,000 something rows
I think Excel also increased the columns.
What is a lac?
I hear it used as a term for a monetary unit. Are you using it the same way here?
Ohhhh Kevin 10 lacs (becomes a million...)
A very frequently used monetary unit in india...apologies, if it created any confusion
My friends from India use the word. I never knew it meant 100,000.
Yes, thats perfect Kevin