A bird’s eye tech view of Docxpresso

Docxpresso is a web app that once installed and correctly parametrized allows any of its backend users to gather data from third parties (humans or machines) and generate dynamical documents based on Office (Word, Libre Office, Open Office, …) templates without additional plugins or help from an IT department.

In a few words Docxpresso converts a “Word document” into a web app that:

  • Can be immediately shared with end users via a link (sent by email or embedded in a web page).
  • Allows for the interaction of end users with the document via a web browser.
  • Integrates external data from different data sources: directly provided by end users, databases, external services, CSV files, …
  • Generates dynamical documents integrating all gather data within the original Office template.
  • Separates presentation (final document) from data so the latter can be parsed, stored and forwarded, if required, to a third party service.

Most importantly Docxpresso is architectured as a service so it is very simple to integrate it into a third party tool like a ERP, CRM, BPM, etcetera or it can be simply used as a “web form interface” for any other web app, website or corporate portal.

The Docxpresso platform is sustained over four different pillars:

  • Web app: Docxpresso from the point of view of its end users is just a responsive web app that runs on a browser regardless of the supporting device: desktop computer, tablet or smartphone.
  • API Core: this is the software library that acting behind the curtains allows for the generation of all the required documents in Open Document Format.
  • Conversion Engine: this subpackage is the one in charge of making all necessary conversions among all the allowed document formats, namely: Word (docx and doc), Open Document Format (odt), Portable Document Format (pdf), Rich Text Format (rtf) and HTML5 + CSS (html).
  • integrated Rest Services and SDK: the Docxpresso Software Development Kit is designed to permit the programmatic interaction with Docxpresso databases and document repositories in a secure and remote way through the integrated Docxpresso REST Services.

In the following we will offer a concise technical presentation of all of these.

Web interface

Docxpresso is a web app developed over PHP using Simfony as MVC framework and Doctrine as ORM.

Symfony and Doctrine incorporate out of the box the following security standards:

  1. HTTPS: Docxpresso is compatible with the standard Secure/Transport Socket Layer allowing for secure communications with clients and third party applications.
  2. Authentication: Symfony integrates all standard authentification features:

    1. LDAP: users in corporate enviroments are able to authenticate via the LDAP or Active Directory service.
    2. Password encryption: all passwords are stored encryted to ensure identity protection.

  3. Access control: Symfony incorporates a complete hierarchy of roles that may be customized to be adapted to specific corporate requirements.
  4. SQL injections: Doctrine uses internally prepared statements that are safe regarding injections.
  5. CSRF protection: Symfony forms are automatically token protected regarding CSRF attacks.
  6. XSS: all input data in public interfaces is properly escaped to avoid Cross Site Scripting attacks.

Besides the standard security measures described above Docxpresso incorporates its own security measures for the exchange of data with end users and/or external data sources.

Each shared link or data exchange is identified with a unique token that is generated by HMAC from a string that incorporates:

  • A random unique id (public)
  • A timestamp (public)
  • A base64 encoded strings of all the URL parameters (public)
  • A private key that is unique to each installation

This assures that the token is uniquely associated to a Docxpresso transaction of a given installation and is uncrackable unless the private key is exposed to third parties.

Docxpresso incorporates 5 different security levels associated with this token protection scheme:

  • Level 0: public access granted.
  • Level 1: APIKEY or token required. The APIKEY is generated out of all the request parameters, a timestamp, a custom unique id, URL parameters and a secret private key. The offered link may be reused at any time.
  • Level 1.5: the same as level 1 but with expiring timestamps (default 24 hours). The offered link may be reused at any time before expiration.
  • Level 2: The APIKEY expires after its first use so the link may be only used once.
  • Level 2.5: The APIKEY expires after its first use or by an expiring timestamp whatever happens first.

Depending of the particular needs you can set up the most adequate security level to guarantee that only authorized users, both of the front and backoffice, can access the app contents.

Docxpresso API Core

The Docxpresso API Core is a standalone PHP package responsible of the generation of documents in Open Document format out of the provided Office templates and the required data.

One could say that the Docxpresso app is just a web interface to this API Core package that eliminates the need of PHP programmers.

Among many other things the API Core package:

  • Transforms back and forth from HTML5 + CSS to Open Document format.
  • Generates all kind of document elements from scratch: paragraphs, rich text, lists, tables, charts, textboxes, sections, forms, etcetera.
  • Clones and replace content in a given Office template.

  • Handles variable properties.
  • Merges documents.

This API Core is extremely performant: it has been benchmarked to generate almost 1.000.000 documents in one hour in an 8 core CPU with 8GB of RAM.

The Docxpresso API Core incorporates natively XXE protection against attacks that may produce, if not avoided, a buffer overflow and the consequential security risks.

Docxpresso Conversion Engine

Most of the times Open Document format is not the desired output format for the requested documents.

Although documents generated in Open Document format may be opened by Word many users prefer to handle directly with Word native formats and also more often than not the required output format is PDF because is “allegedly” non-editable, at least by not highly skilled users.

Both cases require a conversion from ODT to DOCX, DOC (RTF is also supported) or PDF format of the final document.

This is achieved by using one or multiple instances (for high availability or in order to avoid a single point of failure) of Libre Office that take care of the conversion through a Docxpresso macro that should be installed via the command line interface.

Docxpresso API Core comes equipped with a daemon that should run in the background and that it is responsible of managing the conversion queue.

The process is basically as follows:

  • Install Libre Office in one or multiple servers.
  • Install the Docxpresso macro as a shared extension in each LibreOffice installation.
  • Every time Docxpresso is required to generate a document in PDF, DOCX, DOC or RTF format the native ODT document is generated and added to a conversion queue (that is handle via the file system or a database in HA systems)
  • The Docxpresso Daemons manage that queue and sequentially send the document to be converted to the first available Libre office instance based on a simple LIFO algorithm.

  • The converted documents are saved into the required location.

The whole process is asynchronous making it very resilient to peaks in the number of requests.

We have installations that have been running by now for almost two years and generated millions of documents with no downtime whatsoever.

Docxpresso SDK and REST Services

Docxpreso incorporates a plethora of integrated web services that greatly simplify its integration with corporate databases and third party services and apps.

In order to give an idea of the power of the Docxpresso REST interface we pass to enumerate some of the things that can be easily achieved:

  • Remotely obtain a list of document templates organized by categories in a browsable tree.
  • Request for the remote generation of a document based upon a template and data obtained directly from the Docxpresso databases or provided by POST in JSON format.
  • Retrieve final documents (with attachments if available) previously generated via the Docxpresso interface.
  • Open remotely Docxpresso web interface, load data and forward data to an external service.
  • Search Docxpresso databases for end user data and documents.
  • Get statistical usage data.
  • List users

All of these exchanges of information with Docxpresso is protected with HMAC generated tokens that guarantee that only authorized requests are attended.

The Docxpresso SDK is designed to simplify all these tasks so with a simple call to a SDK method the link with all the security parameters and request parameters is generated so one can skip all the nitty-gritty details (that otherwise are not so difficult anyway).

Software Requirements

From the server side Docxpresso requires:

  • PHP > 5.4.* although PHP 7.* is highly recommended
  • Besides the standard PHP modules Docxpresso requires the Tidy and LDAP extensions (if LDAP authentication is required).
  • Database: although most tested with MySQL and MariaDB because of its integration with Doctrine ORM is also compatible with Postgress, Oracle, SQL Server and SQLite.
  • Database drivers: that may depend on the chosen database (PDO_MySQL, OCI8, …)
  • Web server: apache or nginx.
  • Libre Office 5.* for the Docxpresso conversion engine.

Docxpresso has been thoroughly tested in different LINUX flavors (Ubuntu, RedHat, CentOS, Debian,…) and Windows systems and in principle any OS that allows for the installation of the software described above will due.

The client side only requires a reasonable recent version of the more used current Internet browsers. Chrome, IE, Firefox, Opera, etcetera.

The Docxpresso backoffice and end user interfaces do not require the installation of any plugin or browser add-on.

Hardware requirements

The minimum hardware requirements are:

  • Reasonably recent Intel CPU (at least one core should be dedicated to Docxpresso conversion engine)
  • 2GB of RAM

  • 10GB of storage space.

This should be enough for a few editors and up to a thousand of documents per hour (of course, the storage space may depend on the total number of documents generated).

In the case of High Availability environments the Docxpresso app is easily scalable using the standard methods of HA websites, namely:

  • Load balancers
  • Reverse proxies
  • HA proxies
  • DB clusters
  • Etcetera

Format conversions

The most CPU consuming task is the format conversion for final documents, i.e. the generation of PDF and Word files (the generation of native odt file only takes a few hundredths of a second).

Docxpresso first generates an Open Document File (that may be open directly with Word) that if requested should be transformed to PDF or any other available format. As explained before this is done via a daemon or service that calls a conversion macro on a Libre Office instance.

The architecture of the application allows for having multiple instances that share a queue of documents to be converted. In order to dimension correctly the number of Libre Office instances required (which is tantamount to the number of cores) we should take into consideration:

  • The number of documents per hour
  • The throttling or the peak number of documents per second

Taking into account that the conversion of a document can take from 0.1 to 1 second, depending of its number of pages and complexity, this implies that a single core can easily accommodate the generation of a thousand documents per hour with a peak concurrence of less than four/five documents per second.

Although the details may depend on the particular needs of the client, for example documents with multiple charts or many large images may take longer than others to get converted, we may say that a system with two cores dedicated to the conversion engine may easily handle:

  • More than 500.000 documents per month
  • Peaks of up to 8-10 documents per second