From 7c97a1c7ce31ed068fd38199118f8703d990ab97 Mon Sep 17 00:00:00 2001 From: Tyler Danstrom Date: Wed, 3 May 2017 14:47:48 -0500 Subject: [PATCH] issue #1; added notes for how the problem was solved to README.md --- README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/README.md b/README.md index 440f66e..dd62089 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,23 @@ # mamluk-knowledgespace-import This is source code for transforming PDFs from the Mamluk journal project to Simple Archive Format import objects for knowledgespace.uchicago.edu + +Step 1 +====== + +The first step in this project was to extract the useful metadata from the PDFs retrieved from the primary stakeholder. After extraction occured, the data needed to be entered into a report for all stakeholders to review. + +How I solved the first requirement: + +I used the third-party python library PyPDF2 after a quick google search resulted in several StackOverflow discussions pointint to that library. After checking the [project github][https://github.com/mstamy2/PyPDF2], I am comfortable in stating that this project is still active and so still safe to use for this task. + +- https://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/ +- https://pythonhosted.org/PyPDF2/ +- http://stackoverflow.com/questions/32667398/best-tool-for-text-extraction-from-pdf-in-python-3-4 + +How I solved the second requirement: + +I used the python library csv to write a dict to a CSV file + +The output is available at + +https://docs.google.com/spreadsheets/d/1SMuorHqBHLjXySrj4kJqf-Tzo8b-K-W4cEwWFkdvuaQ/edit#gid=1327477525