Moving PDF to text on VMS requires super tools

OpenVMS sometimes sticks around long enough to need a fresh set of talons to grab its data. That’s what Wesley Dunnahoo reports from his site.

“One of our divisions sends PDF documents to a secure FTP directory. They want us to scrape data from the PDF into one of our VMS applications in COBOL,” Dunnahoo says.

“If the fields were in ASCII on the PDF, we could just read the raw PDF file looking for the landmarks and grab the data. We’ve told the division we can’t pull the data off the PDF because a lot of the fields they want are graphics. They still insist they want this done.”

What Dunnahoo needs is a program to convert a PDF to text, preferably one that runs on OpenVMS. Python offers a prospect, but the tool must make its way to OpenVMS.

Alan Winston points out an all-Python PDF tool, PDF Miner, that can extract the text from PDFs. It is slower than compiler-based, “but also should be fully portable,” Winston says. “It can also extract embedded images/graphics if you decide you want to do it.”

Linux can come to the rescue for the extract process, although this introduces another OS to the task. A Linux box in the environment can host the secure directory do the scraping. It could deliver output files into a directory on the VMS box via FTP.

Dunnahoo notes, “Some of what looks like text is really a graphic image of the text. If the image was scanned as OCR, it could interpret the pictures as the text it represents. Most of the fields they want are pictures. We want to read the areas that are pictures of text to be read as text as input to our system, which is written in COBOL. If the division bought some OCR that processed the PDF first, we could get the data. Management won’t buy this software — since the VMS systems will be converted to Windows soon.”

“That’s what they’ve been saying for the last decade, anyway. After they port the systems on the VMS boxes to Windows applications, the company will be 100 percent Windows. They don’t have any Unix or Linux systems.”

Dunnahoo was going to check to see if Python is installed on any of his VMS servers. Python could be installed on a Windows server. “The process runs as a .COM job stream, so having minimal outside steps or manual steps would be desirable. Part of the .COM will FTP the PDFs from the Windows server up to the VMS machine.”

He adds, Seeing as there will be up to 100 forms per day, it’ll take some testing to find a good solution. Until then, they can continue keying the information. It would be great if we could get the source of the data just to provide a fixed or comma-delimited data file instead.”

Leave a Reply