I am working in an eLearning organization. Here we get the print version of the books that is PDF's for the book and we develop the eBook for the same that is we change these PDF's to HTML pages and then make a build for the eBook that can run on the LMS that the client has provided.
So far we had been using the manual approach to verify the correctness and completeness of content in the HTML pages.
Would like to know that is there any tool available that would take the PDF page as an input and then map the entire content to that on the HTML page which the developer has created?
There are many tools that help you convert from pdf to html. COnvert Doc is a pretty good tool which you get in the trial version.I havent found any good one in open source. Google for many more better tools which suits your budget the best.
Thank you for your reply. however could you please let me know as to how would this tool help me in testing? This might help the developer in changing the PDF to HTML.
But the testing team here would like to know only that they have copied all the content from the provided PDF in to the HTML page or not?
[ QUOTE ]
Also You need to remove if there are <img> tags in text file.
[/ QUOTE ]
If you see lot of such tags & formatting related tags in any of the .txt file & its quite troublesome to remove them before starting comparison then you can try converting pdf & html to .doc & try some .doc file comparison program.
Instead of "comparing" both the HTML and pdf in totallity, you may consider breaking down the convertor in terms of functionality
1. How does the convertor handle paragraphs
2. How does the convertor handle single space/double spaces
3. How does the convertor handle text like "bush hid the facts" (You might want to google this one up - incase you are not aware on this notepad defect)
4. How does the convertor handle formatting
5. how does the convertor handle page breaks
6. How does the converot handle font size, color
Look at in this way
INPUT -> Convertor -> OUTPUT
You have to change the input parameters and observe the behaviour in Output.
The inputs would be the possible attributes of a pdf document - A mapping table might assist