Issue 1974: email.MIMEText.MIMEText.as_string incorrectly folding long subject header

classification

Title:	email.MIMEText.MIMEText.as_string incorrectly folding long subject header
Type:
Components:	Library (Lib)	Versions:	Python 2.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	barry	Nosy List:	aalbrecht, barry, cjw296, salty-horse, splorgdar
Priority:		Keywords:	patch

Created on 2008-01-30 14:48 by cjw296, last changed 2008-06-26 08:01 by cjw296.

Files
File name	Uploaded	Description	Edit	Remove
test.py	splorgdar, 2008-05-23 21:49	reproduction case
py_issue1974.diff	aalbrecht, 2008-06-22 12:06	Init Header class in test case without continuation_ws keyword

Messages
msg61866 (view)	Author: Chris Withers (cjw296)	Date: 2008-01-30 14:48
Somewhere in email.MIMEText.MIMEText.as_string (I'm not sure where) long subject headers are folded using a newline followed by a tab: >>> from email.MIMEText import MIMEText >>> m = MIMEText('foo') >>> m['Subject']='AA '*40 >>> m.as_string() 'Content-Type: text/plain; charset="us-ascii"\nMIME-Version: 1.0\nContent-Transfer-Encoding: 7bit\nSubject: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA\n\tAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA \n\nfoo' While RFC 822 section 3.1.1 doesn't forbid this type of folding, the current behaviour mis-renders in both Thunderbird and Outlook. Messages generated by Thunderbird and Outlook represent the above subject as follows: 'Content-Type: text/plain; charset="us-ascii"\nMIME-Version: 1.0\nContent-Transfer-Encoding: 7bit\nSubject: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA\n AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA \n\nfoo' Could the email library be adjusted to do the same? Years of wondering why mails re-sent by mailman, mails from any number of python web systems and recent complaints from customers of mine of mis-rendered emails could be solved by doing so.
msg67270 (view)	Author: Steve Baker (splorgdar)	Date: 2008-05-23 21:33
I'm also hitting this problem. I'm including a demo script. The minimal condition for me to reproduce this bug is that the subject contain at least 76 characters of any kind, followed by a space, followed by at least 3 characters of any kind.
msg68562 (view)	Author: Andi Albrecht (aalbrecht)	Date: 2008-06-22 12:06
For me this issue seems to be a duplicate of issue1645148. In my opinion the test case that checks if headers created by strings or Header instances are equal is incorrect. It shouldn't set the continuation whitespace explicitly when creating a Headers instance. I would expect that A) msg['aheader'] = 'a long string that breaks, but shortened here' and B) msg['aheader'] = Header('a long string that breaks, but shortened here') result in the same output. But for A) a Header instance is initialized in Generator._write_headers() with continuation_ws set to `\t` and for B) the default ' ' is used (see http://bugs.python.org/msg31102). I'm uploading a patch that modifies test_string_headerinst_eq() to what I think it should look like with this message. Of course, this test will fail at the moment, but maybe you agree that the Header instance in this test should be initialized without explicitly setting continuation_ws.
msg68568 (view)	Author: Chris Withers (cjw296)	Date: 2008-06-22 16:06
Andi, I'm in total agreement with you :-) (so if this bug could get fixed, both issues could get closed)
msg68630 (view)	Author: Andi Albrecht (aalbrecht)	Date: 2008-06-23 14:45
IMO, the best way to resolve this is to use the default continuation whitespace when a Header class is created from a string. But I'm pretty unsure about the default whitespace character. After having a look at a random set of mails in my inbox it seems that in most cases a tab is used for all headers except for the subject (excluding mails sent from Python scripts of course...). I would prefer '\t'; it's allowed and results in a cleaner output - but it won't solve the problem with some email clients which was the starting point for this issue (a short note in the docs should resolve this anyway... ;-) But I wonder if for some reason setting headers from strings intentionally differs from using the Header class? There are some tests that explicitly test both cases and expect different output. In TestLongHeaders it's test_long_unbreakable_lines_with_continuation() test_long_lines_with_different_header(self) test_long_received_header() [continuation_ws set to tab here] So, what I would do is * remove continuation_ws keyword from Generator._write_headers() * make '\t' the default cont. whitespace for headers (Header class) * update test cases and the documentation accordingly Do you think that this is the right way to go? Or would it break too much code outside stdlib if the default ws changes?
msg68634 (view)	Author: Barry A. Warsaw (barry)	Date: 2008-06-23 14:59
I'm pretty convinced that this stuff is broken and needs serious attention. Unfortunately, I won't have time to address the email package before 2.6 and 3.0 get released. My current work lives at bzr+ssh://pythonbzr@code.python.org/python/users/barry/30email but it hasn't been updated for a while. I highly suggest taking this to the email-sig to see if anybody else has spare cycles to work in this for now.
msg68658 (view)	Author: Chris Withers (cjw296)	Date: 2008-06-23 23:06
Again, in total agreement with Andi EXCEPT: Please use ' ' instead of '\t' for the continuation character. It's the \t that gets mis-rendered by Outlook and Thunderbird (at the very least!) and since ' ' is also valid according to the RFC, we make life much easier for ourselves if we just stick to that...
msg68665 (view)	Author: Andi Albrecht (aalbrecht)	Date: 2008-06-24 04:33
FWIW, I've uploaded a patch to codereview: http://codereview.appspot.com/2407 It uses a space character as Chris suggested.
msg68750 (view)	Author: Ori Avtalion (salty-horse)	Date: 2008-06-25 19:42
I think there's been a little misinterpretation of the standard in the comments above. It's important to note that RFC 2822 basically defines folding as "adding a CRLF before an existing whitespace in the original message". See http://tools.ietf.org/html/rfc2822#section-2.2.3 It does not allow prepending folded lines with extra characters that were not in the original message such as '\t' or ' '. This is exactly what _encode_chunks does in header.py: joiner = NL + self._continuation_ws (Note that the email package docs and Header docstring use the word 'prepend' which is reflects the error in the code). With a correct implementation, why would I want to choice of which type of character to line-break on when folding? The whole notion of controlling the value of continuation_ws seems wrong. However, changing the default continuation_ws to ' ', as the patch suggests, will output syntactically correct headers in the majority of cases (due to other bugs that remove trailing whitespace and merge consecutive whitespace into one character). All in all, I agree with the change of the default continuation_ws due to its lucky side-effects, but as Barry hinted, the algorithm needs some serious work to really output valid headers. Some examples of the good and bad behaviors: >>> from email.Header import Header >>> l = ['<%d@dom.ain>' % i for i in range(8)] >>> # this turns out fine >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>' # This does not fold even though it should >>> Header('\t'.join(l), continuation_ws=' ').encode() '<0@dom.ain>\t<1@dom.ain>\t<2@dom.ain>\t<3@dom.ain>\t<4@dom.ain>\t<5@dom.ain>\t<6@dom.ain>\t<7@dom.ain>' # And here the 4-char whitespace is shrinked into one >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>'
msg68767 (view)	Author: Chris Withers (cjw296)	Date: 2008-06-26 08:01
Ori, I do agree with both you and Barry but is there any chance someone could make the one-character change to make the /t a space so we can stop seeing weirdness in common mail clients? Perhaps a separate issue could be raised to refactor this all correctly?

History
Date	User	Action	Args
2008-06-26 08:01:52	cjw296	set	messages: + msg68767
2008-06-25 19:42:05	salty-horse	set	nosy: + salty-horse messages: + msg68750
2008-06-24 04:34:00	aalbrecht	set	messages: + msg68665
2008-06-23 23:06:35	cjw296	set	messages: + msg68658
2008-06-23 14:59:24	barry	set	messages: + msg68634
2008-06-23 14:45:33	aalbrecht	set	messages: + msg68630
2008-06-22 16:06:05	cjw296	set	messages: + msg68568
2008-06-22 12:06:53	aalbrecht	set	files: + py_issue1974.diff keywords: + patch messages: + msg68562 nosy: + aalbrecht
2008-05-23 21:49:11	splorgdar	set	files: - test
2008-05-23 21:49:02	splorgdar	set	files: + test.py
2008-05-23 21:33:33	splorgdar	set	files: + test nosy: + splorgdar messages: + msg67270
2008-02-01 15:52:13	gvanrossum	set	assignee: barry nosy: + barry
2008-01-30 14:48:04	cjw296	create