Archive for the ‘Programming’ Category
Using exec to reduce PHP memory use
As mentioned in евтини мебелиa previous post, I’ve recently started to use a library called dompdf to convert HTML to PDF using PHP. One of the major problems I came across was the amount of memory used by this library, which was over 16MB for some files. Initially I generated each file using a call to a function with arguments which described the type of PHP file I wanted to create. The function broadly worked as such:
- Select information from database.
- Assign information to Smarty template.
- Parse Smarty template to produce a string of HTML.
- Create a dompdf object with the HTML string as its input.
- Generate the PDF and save the file to disk.
- Exit the function.
I erroneously assumed that PHP would perform some garbage collection when exiting the function (i.e. when the object fell out of scope), thus freeing up the memory used by the dompdf object. Unfortunately, PHP didn’t do this, resulting in the script using more and more memory each time I called the function, which eventually broke through the limit set for individual PHP scripts and so execution was halted. As the PDF files differed in size, I couldn’t guarantee when this problem would happen and allow for it.
In order to get round the memory use problem, I tried using the unset function to free up the memory used by the dompdf object. However, unset merely removes the reference to an object, and does not force the garbage collector to free up the memory immediately. As a result, the memory barrier was still being broken at an undetermined point.
Finally, I came across a somewhat forceful but successful method of getting around the problem. Instead of putting the PDF generation code in a function, I moved it into a separate PHP file and then executed this file within my main script using the exec function like so:
exec('/usr/bin/php /path/to/GeneratePDF.php');
Instead of using 16MB for each function call and not freeing the memory, therefore using 160MB for 10 calls, PHP freed up all the memory used by the separate script when it finished executing, so at any one point there was never more than 16MB in total being used. I benchmarked both methods (function calls and exec) by trying to create a number of PDF files, and the results were quite impressive: using the function calls took over 60 seconds and several of the files failed, using exec took 30-40 seconds and every file was generated successfully.
Normally I would not recommend using functions which execute external programs, due to the security implications involved. However, in this case no user-supplied data is passed to exec, and the huge improvement in speed and memory use makes it a no-brainer for this particular example.
Further information
- Batch processing slows down PDF generation—mailing list thread on Google Groups.
- Garbage collection on Wikipedia.
Creating PDF files in PHP with Smarty and dompdf
For some time I’ve been looking for a way to generate PDF files dynamically on a Web site for a variety of reasons, though mainly to create files which can be printed or emailed and will work on any computer (yes, I know you can create printer-friendly pages with CSS, but you can’t control footers etc.). The requirements included:
- A sensible layout system which would automatically put elements in the correct place—I don’t want to have to manually specify where the text should begin on the page, align each paragraph, trigger a new page etc.
- The ability to transform HTML into PDF, so that I can use Smarty to produce the PDF files as well as the Web pages.
- Support for as much CSS as possible.
- Active mailing list or other method of getting help from the community.
- Software under active development, i.e. not one of these Sourceforge projects which were started in 2005 and haven’t been updated since.
- Uses a licence which enables the library to be integrated with closed-source software (e.g. BSD or LGPL).
There are several HTML to PDF tools out there, but most of them are not up to the task. HTMLDOC looks promising at first, until you realise that it only supports part of HTML 4 and doesn’t support CSS at all—at least not according to the FAQ. ReportLab also raised my hopes, despite being written in Python (which I’m not familiar with), but a markup language is only available with the commercial version. HTML 2 PDF didn’t even get off the drawing board by the looks of things, and hasn’t been updated since 2005. Finally I came across dompdf, which seems to fit the bill with the following features:
- Supports conversion from HTML to PDF.
- Support for a large chunk of CSS (imperfect but improving).
- Active mailing list, which includes the developers of the software.
- Licensed under the LGPL.
Using dompdf at its most basic is a doddle, you simply pass the HTML in as a parameter to the load_html function and choose whether you wish to stream the result to the browser or output to a file. The bare minimum code is shown below:
$html = $template->fetch('template.tpl');
$filename = '/path/to/file.pdf';
$dompdf = new DOMPDF();
$dompdf->load_html($html);
$dompdf->set_paper('a4', 'portrait');
$dompdf->render();
file_put_contents($filename, $dompdf->output());
Assuming $smarty is a reference to an existing Smarty template, this will create a PDF which should look more or less the same as the Web page generated by $smarty->display('template.tpl'). There are other options which allow you greater control over the final PDF, such as altering the page size, loading extra fonts etc., but the above code will produce a working PDF which you can email or print. The size of the PDF is also extremely small—the invoices which I have working on come in at around 2-3KB each. Try getting that sort of size using Microsoft Word and Adobe Acrobat.
Things to watch out for include:
- File permissions—as you’re saving files to disk the Web server user (www-data if you’re running Apache on Debian) will need write access to the relevant directory.
- Databases—do not store the files in a database. Seriously. By all means store the meta data, such as filename, last modified time etc., in a database (I do this for ease of management), but don’t store the file itself. I’ll write another post later this week about why storing files in a database is a bad idea.
- Unicode—unfortunately dompdf doesn’t have full Unicode support yet, so if you want to create documents with this character set you will have to wait a while. I believe it’s possible to make dompdf work with Unicode by using the commercial version of pdflib, but I haven’t tried that myself.
dompdf isn’t perfect of course, it takes a long time to generate PDF files with tables, some CSS rules don’t work and ordered lists are currently unsupported. However, it is under active development, and there have been performance improvements in recent versions. At the moment I have to generate PDFs as part of a cron job instead of on the fly, and employ a bit of a hack to get round memory usage problems, but I expect those problems to gradually diminish over time. Even with these minor blemishes, dompdf is still the best library I’ve found for converting HTML to PDF in PHP.
Support for PHP 4 to be dropped
There is an announcement on the official PHP website about PHP 4 reaching end of life at the end of this year, with no more development beyond the 31st December 2007. Security patches will continue for a further eight months after this date, but after August 2008 PHP 4 will be no more.
As one can imagine, this decision has caused a strong split of opinion within the PHP community. I suspect that many users will be extremely annoyed by the announcement, even though their applications should work under PHP 5 without any changes. If they don’t, chances are that it’s a problem with their script, although I doubt they will see it like that. On the other hand, GoPHP5 is waving the flag for the ‘move to PHP 5’ group, listing projects and hosts which have pledged to move to PHP 5 by February 2008—although this is something of a moot point now that we know that PHP 4 will be discontinued anyway.
My personal thought on all this is that both the PHP team and web hosting companies are to blame for the slow transition to PHP 5. The latest major version of the software has been out for three years now, yet the PHP team has done very little to push system administrators to upgrade, nor have they provided any major incentives for users to want to move to the latest version and put pressure on their providers to upgrade. Web hosting companies have also failed to even offer PHP 5 in many instances, leaving users with no option but to continue developing for version 4.
Matt has also weighed in to the debate with his recent post, On PHP. I agree wholeheartedly with his point about the PHP core team killing off a popular product (for all its faults, PHP 4 has undeniably been a success) without thinking about why people haven’t upgraded to PHP 5. This could be laziness, in which case the PHP team needs to take steps to ensure that there are incentives to overcome this, e.g. new features, improved security model and smoother updates.
I think what amazes me the most about all this though is that we really should be looking towards PHP 6 by now, yet PHP 5 still isn’t adopted by the majority of hosting companies and end users. If it’s taken this long to move from 4 to 5, how long will moving from 5 to 6 (which has lots of useful improvements according to Jero) take?
Further information
Ten things you might not know about PHP
10 things you (probably) didn’t know about PHP
I’ve just come across this useful and somewhat insightful article on Yet Another Web Development Blog. Most of the tips are things that I’ve already used or knew about, but I hadn’t thought about storing IP addresses as integers or checking the DNS record of the domain of an email address before verifying it. The tips are definitely worth a look for anyone who regularly programs in PHP, especially if you write a lot of your own code for running websites (as I do).
PHP Easter Eggs
PHP’s “doggie†easter egg via SitePoint
Apparently by appending different query strings to PHP scripts, you can get various “easter egg” images to appear. Rather amusing, although you do wonder why developers bother putting features like these into what is supposedly a serious scripting language.
More Information
Top seven PHP security blunders
There is an interesting article on SitePoint at the moment, entitled Top 7 PHP Security Blunders. It’s lacking detail for most of the security issues raised, but it’s a useful article nevertheless. If it stops just one newbie PHP developer from making a major security blunder then it will have been worth the time spent writing the article, in my opinion.