Data Circle

Technology news, tips and tricks

Using exec to reduce PHP memory use

without comments

As mentioned in евтини мебелиa previous post, I’ve recently started to use a library called dompdf to convert HTML to PDF using PHP. One of the major problems I came across was the amount of memory used by this library, which was over 16MB for some files. Initially I generated each file using a call to a function with arguments which described the type of PHP file I wanted to create. The function broadly worked as such:

  1. Select information from database.
  2. Assign information to Smarty template.
  3. Parse Smarty template to produce a string of HTML.
  4. Create a dompdf object with the HTML string as its input.
  5. Generate the PDF and save the file to disk.
  6. Exit the function.

I erroneously assumed that PHP would perform some garbage collection when exiting the function (i.e. when the object fell out of scope), thus freeing up the memory used by the dompdf object. Unfortunately, PHP didn’t do this, resulting in the script using more and more memory each time I called the function, which eventually broke through the limit set for individual PHP scripts and so execution was halted. As the PDF files differed in size, I couldn’t guarantee when this problem would happen and allow for it.

In order to get round the memory use problem, I tried using the unset function to free up the memory used by the dompdf object. However, unset merely removes the reference to an object, and does not force the garbage collector to free up the memory immediately. As a result, the memory barrier was still being broken at an undetermined point.

Finally, I came across a somewhat forceful but successful method of getting around the problem. Instead of putting the PDF generation code in a function, I moved it into a separate PHP file and then executed this file within my main script using the exec function like so:

exec('/usr/bin/php /path/to/GeneratePDF.php');

Instead of using 16MB for each function call and not freeing the memory, therefore using 160MB for 10 calls, PHP freed up all the memory used by the separate script when it finished executing, so at any one point there was never more than 16MB in total being used. I benchmarked both methods (function calls and exec) by trying to create a number of PDF files, and the results were quite impressive: using the function calls took over 60 seconds and several of the files failed, using exec took 30-40 seconds and every file was generated successfully.

Normally I would not recommend using functions which execute external programs, due to the security implications involved. However, in this case no user-supplied data is passed to exec, and the huge improvement in speed and memory use makes it a no-brainer for this particular example.

Further information

Written by Paul

January 9th, 2010 at 6:50 pm

Posted in PHP

Commercial work for mySociety

without comments

Ever wanted to work for mySociety, but on commercial projects rather than that fluffy democracy stuff? Well now you can, as the commercial arm (mySociety Ltd) currently has a job advert for “a web developer to work onsite with a commercial project with a corporate partner in London, UK”. You’ll have to move quickly though, as submissions need to be in by 11 January and work starts on 1 February, which probably rules out anyone looking to change jobs.

Written by Paul

January 5th, 2010 at 10:10 am

Posted in Jobs

Creating PDF files in PHP with Smarty and dompdf

with one comment

For some time I’ve been looking for a way to generate PDF files dynamically on a Web site for a variety of reasons, though mainly to create files which can be printed or emailed and will work on any computer (yes, I know you can create printer-friendly pages with CSS, but you can’t control footers etc.). The requirements included:

  1. A sensible layout system which would automatically put elements in the correct place—I don’t want to have to manually specify where the text should begin on the page, align each paragraph, trigger a new page etc.
  2. The ability to transform HTML into PDF, so that I can use Smarty to produce the PDF files as well as the Web pages.
  3. Support for as much CSS as possible.
  4. Active mailing list or other method of getting help from the community.
  5. Software under active development, i.e. not one of these Sourceforge projects which were started in 2005 and haven’t been updated since.
  6. Uses a licence which enables the library to be integrated with closed-source software (e.g. BSD or LGPL).

There are several HTML to PDF tools out there, but most of them are not up to the task. HTMLDOC looks promising at first, until you realise that it only supports part of HTML 4 and doesn’t support CSS at all—at least not according to the FAQ. ReportLab also raised my hopes, despite being written in Python (which I’m not familiar with), but a markup language is only available with the commercial version. HTML 2 PDF didn’t even get off the drawing board by the looks of things, and hasn’t been updated since 2005. Finally I came across dompdf, which seems to fit the bill with the following features:

  1. Supports conversion from HTML to PDF.
  2. Support for a large chunk of CSS (imperfect but improving).
  3. Active mailing list, which includes the developers of the software.
  4. Licensed under the LGPL.

Using dompdf at its most basic is a doddle, you simply pass the HTML in as a parameter to the load_html function and choose whether you wish to stream the result to the browser or output to a file. The bare minimum code is shown below:

$html = $template->fetch('template.tpl');
$filename = '/path/to/file.pdf';

$dompdf = new DOMPDF();
$dompdf->load_html($html);
$dompdf->set_paper('a4', 'portrait');
$dompdf->render();
file_put_contents($filename, $dompdf->output());

Assuming $smarty is a reference to an existing Smarty template, this will create a PDF which should look more or less the same as the Web page generated by $smarty->display('template.tpl'). There are other options which allow you greater control over the final PDF, such as altering the page size, loading extra fonts etc., but the above code will produce a working PDF which you can email or print. The size of the PDF is also extremely small—the invoices which I have working on come in at around 2-3KB each. Try getting that sort of size using Microsoft Word and Adobe Acrobat.

Things to watch out for include:

  1. File permissions—as you’re saving files to disk the Web server user (www-data if you’re running Apache on Debian) will need write access to the relevant directory.
  2. Databases—do not store the files in a database. Seriously. By all means store the meta data, such as filename, last modified time etc., in a database (I do this for ease of management), but don’t store the file itself. I’ll write another post later this week about why storing files in a database is a bad idea.
  3. Unicode—unfortunately dompdf doesn’t have full Unicode support yet, so if you want to create documents with this character set you will have to wait a while. I believe it’s possible to make dompdf work with Unicode by using the commercial version of pdflib, but I haven’t tried that myself.

dompdf isn’t perfect of course, it takes a long time to generate PDF files with tables, some CSS rules don’t work and ordered lists are currently unsupported. However, it is under active development, and there have been performance improvements in recent versions. At the moment I have to generate PDFs as part of a cron job instead of on the fly, and employ a bit of a hack to get round memory usage problems, but I expect those problems to gradually diminish over time. Even with these minor blemishes, dompdf is still the best library I’ve found for converting HTML to PDF in PHP.

Written by Paul

January 4th, 2010 at 8:00 am

Posted in PHP

UKUUG Spring 2010 conference bookings

without comments

I forgot to mention this earlier, but bookings for the UKUUG Spring 2010 conference are now open. Taking place in Manchester on 23-25 March, the programme includes such talks as:

  • The Perdition and nginx IMAP Proxies (Jan-Piet Mens)
  • Don’t be scared of SELinux (James Firth)
  • Hudson hit my puppet with a cucumber (Patrick Debois and Julian Simpson)
  • MySQL HA with pacemaker (Kris Buytaert)

A full list can be found on the provisional programme. There will also be a social event on the Tuesday night which is open to all, including those who cannot attend the conference itself.

Update: Correction on dates, I originally said the conference would be 24-26 March but it will actually take place on 23-25 March.

Written by Paul

January 3rd, 2010 at 8:00 am

Posted in Conferences, UKUUG

Urgent SpamAssassin update

without comments

If you’re running SpamAssassin on your servers you might want to check out this critical bug: FH_DATE_PAST_20XX scores on all mails dated 2010 or later. Broadly speaking, all emails with a Date: field of 2010 will trigger a SpamAssassin rule, which means that the mail will be more likely (but not definitely) to be marked as spam. Running sa-update as root fixed the problem on my system, as this updates the SpamAssassin rules. The fix is only temporary, as it will flag up mails from 2020 onwards, but hopefully the developers will have implemented a more permanent fix by then.

Written by Paul

January 2nd, 2010 at 5:00 pm

Posted in Security, Spam

Securing ssh on Linux

without comments

After some comments about security recently on the Bytemark IRC channel, I started to look through the logs for the various services which I run on my servers to see if there were any obvious security holes which needed plugging. Most services seemed secure, and I’d already cracked down on spam, but I did notice a lot of failed login attempts for ssh in the authentication log (/var/log/auth.log on Debian systems if you want to check your own). Every so often, there would be a burst of login attempts from a given IP address, which would either attempt to brute force the root password or try to login to common account names (e.g. guest). I only have a small number of accounts on my servers anyway, and the root passwords are all strong, so I wasn’t too worried that any of these attempts would succeed, but nevertheless I wanted to take some steps to reduce or eliminate them.

First, I considered reporting the offending IP addresses to their netblock abuse contacts. Unfortunately many of the connections came from Brazil or China, where I don’t expect to get a response. Even when I contacted European ISPs with logs of what machines on their netblock were doing, I didn’t receive any replies, though it’s possible they took action and couldn’t tell me about it.

Since trying to get the netblock admins to tackle the problem at its source didn’t do much good, I decided to deploy some technical measure instead, namely:

  1. Disabling root logins over ssh.
  2. Moving ssh to a different port.

Disabling root logins does what it says on the tin, i.e. you can prevent anyone from logging in as root over ssh, even if they provide the correct password. The response from the server is the same as if an incorrect password was entered, so they will not know that you have disabled this feature. The relevant setting can be found in the ssh config file (/etc/ssh/sshd_config on Debian):

PermitRootLogin no

If this is currently set to PermitRootLogin yes, simply change the yes to no and reload ssh (/etc/init.d/ssh reload). You’ve now effectively blocked anyone from brute forcing your root password over ssh, although the login attempt messages will still appear in your logs.

Moving ssh to another port is the second effective method of stopping people from trying to brute force their way into your machine. By default ssh runs on port 22, but there is no reason why it needs to, other than convention. You can easily change the port using the following directive in the ssh config file:

Port 22

You can change the value to anything you want, so long as another service is not running on the same port—e.g. don’t change to port 80 if you’re already running a web server. Most brute force login attempts will come from automated programs which assume that ssh is running on port 22, so if you move the service to another port they will simply receive a ‘Connection refused’ error message. Making this change has, as of the time of writing, completely eliminated all the failed login attempt messages in my authentication logs.

You could of course go one step further and only permit ssh logins from users with keys or from specific IP addresses, but that creates a degree of management overhead which you may not wish to bear. In my case, I access four different servers from various locations and machines, so it’s simply not practical to use such a tight restriction.

Other ssh security tips which you might want to try if you are really paranoid include:

  1. Use fail2ban to scan your log files and automatically block any IP addresses which produce too many failed login attempts.
  2. If you have multiple IP addresses, limit ssh so that it only runs on one of them, using ListenAddress.
  3. Throttle ssh connections so that only one connection per IP address can be made in a given length of time – e.g. every minute. This will significantly slow down any brute force attempts.
  4. If you don’t need ssh, simply disable it entirely (this is probably not an option for most administrators though).

Hopefully these tips will help you to lock down ssh, as it’s in all our interests for machines on the Internet to be as secure as possible.

Written by Paul

January 1st, 2010 at 12:58 pm

Posted in Security

WordPress Q&A videos

without comments

Some interesting WordPress Q&A videos:

Hat tip to Matt for all the links.

Written by Paul

November 29th, 2009 at 8:00 pm

Posted in WordPress

Twitter in DNS

without comments

It was only a matter of time before someone did this, but you can now check Twitter and Identi.ca through DNS requests at Any.IO. Some people have far too much time on their hands…

Hat tip to Jan-Piet Mens for the link.

Written by Paul

November 29th, 2009 at 12:54 pm

Posted in Twitter

UKUUG Spring 2010 CFP

without comments

Just a quick reminder for anyone who I haven’t emailed already that the Call for Papers for the UKUUG Spring 2010 conference is still open—submission deadline is 15 November 2009.

Written by Paul

October 24th, 2009 at 10:29 pm

Posted in Conferences

Hardening WordPress release

without comments

WordPress 2.8.5 has just been released. There are no new fancy features included, this is purely a “hardening” release which is intended to fix a few security issues. As always, the advice is to upgrade as soon as possible, and to check out the Exploit Scanner plugin if you think your blog may have been compromised at any point.

Written by Paul

October 22nd, 2009 at 11:31 am

Posted in WordPress