Email from DevA to David June 18 2012 ===================================== Dear David, With the implementation of the custom machine translation tool for the Latin American and Caribbean Regional At-Large Organisation (LACRALO) mailing lists since mid last year (2011), several issues or factors continue to negatively impact the working of the machine translation tool and in turn, has lead to great difficulty in communication and collaboration with the English and Spanish speaking communities in the LAC region. To recap, LACRALO has two mailing lists LACRALO list in English: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/ LACRALO list in Spanish: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/ emails in english sent to lac-discuss-en@atlarge-lists.icann.org are machine translated via your custom tool using Google Translate and posted to lac-discuss-es@atlarge-lists.icann.org. Similarly, emails in Spanish sent to the lac-discuss-es@atlarge-lists.icann.org are translated and posted to lac-discuss-en@atlarge-lists.icann.org. To date, several issues have been detected 1) Attachments in emails sent to a list are not received on the other list. ---------------------------------------------------------------------------- When an email with attachments such as PDFs is sent to one list, the subject line and body of the email is translated and sent to the other list BUT without the attachment. 2) Subject lines of translated emails from ES to EN become garbled. --------------------------------------------------------------------- The subject line of translated emails (seemingly) from the lac-discuss-ES list to the lac-discuss-EN list often translated to garbled text. Examples abound from a review of the archives. As one example: (a) First email posted to lac-discuss-en list : Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005932.html Subject line: [lac-discuss-en] ICANN full list of applied for gTLD strings (b) which is translated and posted to lac-discuss-es list as: Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004552.html Subject line: Lista completa de la ICANN solicitó cadenas de gTLD (c) Someone on the lac-discuss-es list responds posts to lac-discuss-es list as: Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004553.html Subject line: [lac-discuss-es] Lista completa de la ICANN solicitó cadenas de gTLD (d) which is translated and posted to the en list as: Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005933.html Subject: [lac-discuss-en] =? Iso-8859-1? Q? Lista_completa_de_la_ICANN_solici? == Iso-8859-1? Q? T = F3_cadenas_de_gTLD? = Another example: Email on lac-discuss-es list : http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004518.html Subject line: [lac-discuss-es] RES: Alerta de Noticias de la ICANN - Aviso de Prórroga del período que abarca la ICANN: ICANN FY13 Proyecto de Plan Operativo y Presupuesto gets translated and posted as an email on lac-discuss-es list: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005897.html Subject line: [lac-discuss-en] =? Utf-8? Q? RES = 3A_Alerta_de_Noticias_de_la_ICANN_? == Utf-8? Q?-_Aviso_de_Pr = C3 = C3 = ADodo_que_abarca_la_ B3rroga_del_per =? == Utf-8? Q? ICANN 3A_ICANN_FY13_Proyecto_de_Plan_Operativo_y_Presupu =? == utf-8? q? this? = Note the difference with "Utf-8? Q?" in this example as compared to "Iso-8859-1? Q?" in the previous example. Such gibberish in the subject lines can get even worse if someone responds on the EN list and the translation further scrambles the subject line on the other list. Again, examples abound from a review of the archives but as one example, consider the subject line for an email on lac-discuss-es list Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004039.html Subject line: [lac-discuss-es] =? Iso-8859-1? Q? Invitación = F3n_a_la_reuni = F3n_/_LAC? == Iso-8859-1? Q? RALO_Costa_Rica_Eventos_rueda_de_prensa_Grupo_de_Tr? == Iso-8859-1? Q? Abajo_el_martes_06_de_marzo_2012_a_las_20 = 3A00_UTC? = which gets translated and posted to the EN list as Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005357.html Subject line: [lac-discuss-en] =? Iso-8859-1? Q? = 3D = 3F_Iso-8859-1 = 3F_Q = 3F_Invitac? == Iso-8859-1? Q? I = F3n_ = 3D_F3n = 5FA = 5Fla = 5Freuni_ = 3D_F3n = 5F / = 5FLAC = 3F? == iso-8859-1? q? _ = 3D = 3D_Iso-8859-1 = 3F_Q = 3F_RALO = 5FCosta = 5FRica = 5FEv? == iso-8859-1? q ? ents = 5Frueda = 5Fde = 5Fprensa = 5FGrupo = 5Fde = 5FTr = 3F_? == iso-8859-1? q? = 3D = 3D_Iso-8859-1 = 3F_Q = 3F_Abajo = 5Fel 5Fmartes = 5F06 =? == iso-8859-1? q? = 5Fde = 5Fmarzo = 5F2012 = 5FA = 5Flas = 5F20_ = 3D_3A00 = 5FUTC? == iso-8859-1? q? = 3F_ = 3D? = 3) Missing [lac-discuss-es] in subject lines of translated emails posted to the lac-discuss-es list ---------------------------------------------------------------------------------------------------- Consider example #1 again - First email posted to lac-discuss-en list : Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005932.html Subject line: [lac-discuss-en] ICANN full list of applied for gTLD strings which is translated and posted to lac-discuss-es list as: Email: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004552.html Subject line: Lista completa de la ICANN solicitó cadenas de gTLD The email at http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004552.html shows that the subject line is missing the [lac-discuss-es]. This hampers filtering by ES users and makes it difficult to track threaded conversations 4) Unusual superscript and other odd characters in translated emails There have been numerous complaints about the quality of the translation of the actual body of emails with strange characters, some of which are superscript characters appearing in the translated version. Examples aboud (repeating phrase, I'm afraid) on the LACRALO list archives, here is one example: Email to lac-discuss-en : http://atlarge-lists.icann.org/pipermail/lac-discuss-en/2012/005858.html got translated to this on the lac-discuss-es list: http://atlarge-lists.icann.org/pipermail/lac-discuss-es/2012/004483.html As you can see, * a character like a double quote " is translated to " * a word like "organisation" is translated to organización * a sentence like "The highest decision making body in any organisation is also subject to rules." is translated to "El más alto órgano de decisión en cualquier la organización también está sujeto a reglas." This is a summary of the key issues affecting the machine translation of emails in LACRALO. I hope to have the opportunity to chat with you in Prague and trust the identification of the issues in this email is sufficient to clarify the problems so that solutions can be developed. Kind Regards, Dev Anand Teelucksingh ------------------------ Response from David Closson June 21 2012 ======================================== Greetings Dev, The developer of the translation bot, Kent Crispin, has provided some detailed responses. I was aware of most of issues but preferred to have Kent respond since 2 of the issues will indeed require significant development effort. I am waiting to get additional feedback about wether item 2 can be prioritized and resourced this fiscal year. Item 1 is would be more time consuming and was not part of the original requirements. Item 3 is likely a quick fix and may be delivered quite soon. These responses are somewhat technical in nature and perhaps need some translation for general consumption. Regarding 1, attachments not being carried through the translation process: That the translation engine would only work on the text part of an email was an explicit design limitation. There are two lists configured in Mailman that are linked through the transbot email address. These lists are configured independently; if one of them is configured to accept attachments, then that list will get attachments sent to all the members. But emails that go through the transbot are stripped of all attachments, and so the translated message forwarded to the other list loses the attachments. I believe it would be possible to make the transbot carry attachments through, but it would be a significant development effort. Mime encoded email messages can have quite complex recursive structure, and preserving that structure through the text mangling associated with translation is a potentially very error-prone and complex task. But, since mime is fundamentally well-specified, I think it could be done correctly, if sufficient effort were deployed. Potentially many person-months. Regarding 2, Subject lines of translated emails from ES to EN become garbled. Starting out with a little background: The base encoding for all email messages is US ASCII. The various MIME standards describe how to encode other character sets and data types into US ASCII, so that they can be sent over email. RFC 2047, in particular, describes how mime is applied to email header fields. Basically, a string like "this is some text" that is encoded in, for example, UTF-8 (instead of US ASCII), should not be put in the Subject: header of an email. Instead, it would be encoded as Subject: =?utf-8?q?this=20is=20some=20text?= A mail reader would convert this back to Subject: this is some text in utf-8. A mail client running on a system with a base character set of utf-8 would automatically encode email header fields as above, so that the base form of the email message would all be in US ASCII. Mail readers on the recipient side would in theory convert the the message back to utf-8; but it may be that the recipient's native character set isn't utf-8 -- it might be iso-8859-1, or something else, in which case the appearance of utf-8 characters would be potentially garbled. It should be noted that there is another complication. I suspect that the examples presented are from the web archive of the list. However, the presentation on the web archive is filtered through the archiving software, the webservers base character set, and how apache is configured. It is theoretically possible that the appearance in various mail clients is different (though I suspect just as garbled). At this point I don't know if this is a solvable problem or not. ("Unsolvable" would mean that there are too many unknowns in how different mail clients are handling this to make it possible to come up with a solution.) If it is solvable, it will take a significant coding effort. In addition, it will take significant research and testing. Overall, the effort is measured in months. I should point out, however, that this is the kind of problem that might be solved with a clever insight, or an adjustment of requirements that gave a passable solution. Unfortunately, I'm not seeing those. 3. Missing [lac-discuss-es] in subject lines This is probably something that can be fixed, and I will look into it. Email from DevA to David on July 20 2012 ======================================== Dear David, The next Technology Taskforce call will be on Monday 23 July 2012 from 17:00 UTC. Sorry for not sending this email earlier regarding our chat in the corridor in Prague to discuss the LACRALO machine translation list problems. Trust you are well. Any futher updates you can provide on the 4 issues since Prague? I think the key additional points in the corridor were: - polling LACRALO members regarding what email clients they use to send and receive emails - enabling RSS feeds of email lists as a workaround to deal with the encoding issues of subject lines and bodies of emails Email from David to DevA on July 20 2012 ======================================== I am not available for the call on Monday but Simon Raveh (Director of Software Development) and I would like to attend a meeting at a near future date. > polling LACRALO members regarding what email clients they use to > send and receive emails A) Yes, would be interesting. > enabling RSS feeds of email lists as a workaround to deal with the > encoding issues of subject lines and bodies of emails A) Yes, I think there is a module to allow RSS feeds from Mailman. This is the most promising avenue for us to discuss Outcomes from Technology Taskforce Call on July 23 2012 ========================================================= With Kent Crispin attending (audio at nearly 1 hour into the call) Re: polling LACRALO members regarding what email clients they use to send and receive emails Not considered because - the amount of factors (email client, OS, operating system, etc) needed from persons on LACRALO list would be difficult to provide - the resulting data would be a large matrix of data which would be difficult to detect any trend. Re: enabling RSS feeds of email lists as a workaround to deal with the encoding issues of subject lines and bodies of emails - interesting idea to consider. IT staff will evaluate