Using DocumentCloud to Expose Public Documents in OpenPublish

Posted Oct 29, 2010
Jeff:

Most of us who are not professional journalists take for granted the availability and access to public information. We assume that sources and public documents are easy to come by. The truth is that even in a democratic society, it takes work to get and gather the facts. Digging up the facts may be an exciting challenge for many journalists, particularly those drawn to the investigative side of the trade. But for newsrooms facings budget cuts and publishers tempted by lower cost, he work of locating, researching, organizing, searching and referencing public documents is not a welcome cost of doing business these days.

Our friends at DocumentCloud have a refreshing and innovative technical approach for sharing public documents that we are now incorporating portions of into OpenPublish. Conceived of over beers and sandwiches by journalists Aron Pilhofer (The New York Times), Eric Umansky (ProPublica) and Scott Klein (ProPublica), DocumentCloud is an independent nonprofit funded in 2009 by a two-year grant from the Knight News Challenge.

DocumentCloud is an index of primary source documents and a tool for annotating, organizing and publishing them on the web. Essentially DocumentCloud creates a community and workspace to analyze, share, and publish public documents - giving journalists a very powerful and free way to make public documents actually public and useful. DocumentCloud relies on three of our favorite principles to do this.

1. Community collaboration - Documents are contributed by journalists, researchers and archivists from organizations invited or accepted into the group - this makes it a bit of a walled garden vs. a completely open community, but for purposes of necessity. If you work for any type of organization doing investigative journalism or reporting though access is easy to get. The way it works is that you get access to the DocumentCloud workspace, where you can upload documents, share them with your team, and conduct structured searches and analyses. This facilitates sharing and collaboration and reduces wasted time. It also preserves the documents for re-use later.

2. Cool tech - There are a few cool technologies in place here including a nice Ruby on RAILS framework and a complete API to access the assets. The service itself is all cloud-based of course and therefore light / efficient on infrastructure. Our favorite part is the use of Open Calais to auto tag the documents and apply rich metadata that can be used in analysis (e.g. person, place, organization entity recognition).

3. Open source - In keeping with the sense of community and openness, the code is being fully open sourced. Not all components are available yet, but there are plans to release pieces as appropriate.

On the OpenPublish project, we are working toward tighter integration with several forms of asset management for embedding information in your OpenPublish site. In our latest release OpenPublish 2.3 (released yesterday), we have integrated the document viewer technology developed and open sourced by the New York Times and used by the DocumentCloud team to provide two new features.

1. Make it easier for publishers to locate and publish PDF assets from DocumentCloud to augment their content in line within OpenPublish.


DocumentCloud Attached Documents


2. Provide a better user experience for linked documents that includes an integrated document viewer, raw text and other extracted data, annotations from the researcher/journalist and a search interface to locate terms in context.


DocumentCloud Document Viewer


This is just the start of our plans for greater DocumentCloud integration and we look forward to further collaboration with their team, input from users and ideas from the journalism community on how to expand upon this first step in future releases.

As a final note, I am joining Aron Pilhofer in his session today at the Online News Association (ONA) 2010 conference here in DC to demonstrate how these features work within the context of OpenPublish.

About Jeff

CEO and co-founder Jeff Walpole leads strategy and firm development efforts for Phase2. Jeff has been instrumental in recruiting and managing staff, the acquisition of new clients, overseeing client engagements and leading process improvement ...

more >

Read Jeff's Blog