[Mono-dev] mcs patch for default encoding
kornelpal at hotmail.com
Tue Aug 23 06:48:24 EDT 2005
I've tried to compile a 2 GB size file using csc.exe: I got out of memory
error. The I reduced the size to 500 MB but I still got out of memory.
Finally I was able to compile a 200 MB file.
I got error CS1034: Compiler limit exceeded: Line cannot exceed 2046
So I added line breaks as well. And added // to the beginning of each line
to add some non-whitespace chars just for fun and to test the compiler.:)
The first non-ASCII character is very near to the end of the file. csc.exe
compiled it correctly. UTF-8 and ACP as well. DétectEncoding was compiled
correctly in both cases. I attached the test cases (about 200 MB each).
So I think csc.exe parses the whole file to detect UTF-8 and has poor memory
management in addition.:) Maybe it chaches the source file using it's own
----- Original Message -----
From: "Kornél Pál" <kornelpal at hotmail.com>
To: "Atsushi Eno" <atsushi at ximian.com>
Cc: "Marek Safar" <marek.safar at seznam.cz>; "mono-devel mailing list"
<mono-devel-list at lists.ximian.com>
Sent: Tuesday, August 23, 2005 11:53 AM
Subject: Re: [Mono-dev] mcs patch for default encoding
> There is no other solution to detect UTF-8 without BOM so csc.exe has to
> the same.:) But this test could be done only on the first n bytes of a
> stream then it could be assumed that the rest of the stream has the same
> ----- Original Message -----
> From: "Atsushi Eno" <atsushi at ximian.com>
> To: "Kornél Pál" <kornelpal at hotmail.com>
> Cc: "mono-devel mailing list" <mono-devel-list at lists.ximian.com>; "Marek
> Safar" <marek.safar at seznam.cz>
> Sent: Tuesday, August 23, 2005 11:50 AM
> Subject: Re: [Mono-dev] mcs patch for default encoding
>>I don't think this is acceptable because of its significant
>> performance loss (reading the entire stream)...
>> Atsushi Eno
>> Kornél Pál wrote:
>>> Character set detection.
>>> This code uses a UTF8Encoding with throwOnInvalidBytes. StreamReader
>>> BOM (UTF-8, Unicode, Unicode (Big-Endian)). UTF-8 is easy to validate as
>>> has strict rules regarding the byte
>>> representation of character. So it's safe to assume that a text is UTF-8
>>> it can be parsed as UTF-8. UTF8Encoding (with throwOnInvalidBytes)
>>> ArgumentException when it is
>>> not UTF-8. In this case fall back to Encoding.Default.
>>> Unicode (16-bit) is not detected by csc.exe without BOM so I think we
>>> shouldn't deal with it.
>> Mono-devel-list mailing list
>> Mono-devel-list at lists.ximian.com
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 41115 bytes
Desc: not available
Url : http://lists.ximian.com/pipermail/mono-devel-list/attachments/20050823/507dd10b/attachment.obj
More information about the Mono-devel-list