Wednesday, May 13, 2015

How to retrieve the HTML from en email with Linux?


I'm using a Linux machine (Debian) and installed fetchmail and procmail on it. Right now, it is able to fetch mails sent to my gmail address every one minute. However, the file fetched by fetchmail is not a regular html file, but contains quoted-printable characters.


My goal is to be able to get the HTML source code of the email, just like I would be able to do if I open Outlook on Windows, right-click on the mail and choose "Show source code".


It is also important to be able to do it programmatically, since I want to automate the task.


Edit: My initial question lacks precision, so here is more information about my problem. One of my client sends me an email everyday, that contains a table with previous business day selling data. Right now, everyday, I have to copy paste the data from the email into an EXCEL file. However, since this is a very repetitive task, I want to have a solution that can that do that for me automatically. I know how to write data to an EXCEL file with Python, and I have some idea how to retrieve information from an html document with some Python libraries.


So I decided to use crontab with Linux, fetchmail and procmail to do the following: every one minute, I check on my mail box if I have received an email from my client. If so, I run a python script on that file to retrieve the information. After that, I input that information to an EXCEL file and send it to myself.


The script works fine on my PC with the HTML source code of the email I got from Outlook. However, on my Linux machine, it doesn't work. I opened the source code of the email file from my Linux machine and found out that part of the HTML code was modified.


Answer



You seem to have some invalid assumptions here. Email does not necessarily have a single body part and it might not be in HTML.


Without more information about what you actually want, this is going to be very speculative; but something like


:0B:
* Content-type: text/html
* Content-transfer-encoding: quoted-printable
| quoted-printable --decode >>extracted.html

will decode QP and append to a growing file of HTML payloads (assuming you have a command named quoted-printable with the option --decode to decode QP).


This is probably not useful as such, because most HTML payloads are in MIME multipart containers. The above assumes the message has a single top-level payload which is text/html and encoded with quoted-printable, and will simply no-op if this is not true.


Procmail is not particularly good at traversing MIME structures, but something similar should be easy to write with e.g. Python and the standard email library. There are also standalone tools like ripmime which allow you to extract selected payloads from a MIME message.


No comments:

Post a Comment