Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German Umlauts break search #18

Open
MaStr opened this issue Nov 27, 2012 · 10 comments
Open

German Umlauts break search #18

MaStr opened this issue Nov 27, 2012 · 10 comments

Comments

@MaStr
Copy link
Contributor

MaStr commented Nov 27, 2012

Hi,
on a remote system, I have a file with a "ö".
This causes to break the search on every connected forban.

Browsing works.

Error in forbarn_share_error.log:

--- will included soon ---

Any idea?

Matthias

@MaStr
Copy link
Contributor Author

MaStr commented Nov 27, 2012

[21/Nov/2012:08:02:55] HTTP Traceback (most recent call last):
File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, *_self.kwargs)
File "/opt/forban/bin/forban_share.py", line 246, in q
html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:02:55] HTTP
Request Headers:
REFERER: http://piratebox.lan:12555/
HOST: piratebox.lan:12555
CONNECTION: keep-alive
CACHE-CONTROL: max-age=0
Remote-Addr: ::ffff:192.168.1.168
ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3
USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip,deflate,sdch
[21/Nov/2012:08:03:01] HTTP Traceback (most recent call last):
File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, **self.kwargs)
File "/opt/forban/bin/forban_share.py", line 246, in q
html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:01] HTTP
Request Headers:
REFERER: http://piratebox.lan:12555/
HOST: piratebox.lan:12555
CONNECTION: keep-alive
CACHE-CONTROL: max-age=0
Remote-Addr: ::ffff:192.168.1.168
ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3
USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip,deflate,sdch
[21/Nov/2012:08:03:08] HTTP Traceback (most recent call last):
File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, **self.kwargs)
File "/opt/forban/bin/forban_share.py", line 246, in q
html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:08] HTTP
Request Headers:
REFERER: http://piratebox.lan:12555/
HOST: piratebox.lan:12555
CONNECTION: keep-alive
CACHE-CONTROL: max-age=0
Remote-Addr: ::ffff:192.168.1.168
ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3
USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip,deflate,sdch
[21/Nov/2012:08:03:45] HTTP Traceback (most recent call last):
File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, **self.kwargs)
File "/opt/forban/bin/forban_share.py", line 246, in q
html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1]))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:45] HTTP
Request Headers:
REFERER: http://piratebox.lan:12555/
HOST: piratebox.lan:12555
CONNECTION: keep-alive
CACHE-CONTROL: max-age=0
Remote-Addr: ::ffff:192.168.1.168
ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3
USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip,deflate,sdch

@MaStr
Copy link
Contributor Author

MaStr commented Nov 27, 2012

btw: in the browse-list, the Umlaut looks like an utf character (double byte)

@adulau
Copy link
Owner

adulau commented Nov 27, 2012

Hi Matthias,

I did a small test writing a file named "ö.txt" in the share directory.

http://127.0.0.1:12555/q/?v=%C3%B6

I didn't get the same exception. Could you start a Python on the server and check
the default encoding?

import sys
print sys.getdefaultencoding()

Just to be sure.

@MaStr
Copy link
Contributor Author

MaStr commented Dec 26, 2012

root@rPt4WCYo:/# python
Python 2.7.3 (default, Nov 3 2012, 11:37:47)
[GCC 4.6.3 20120201 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sys
print sys.getdefaultencoding()>>>
ascii

@adulau
Copy link
Owner

adulau commented Jan 5, 2013

I found the issue(s) but I'm currently struggling how to fix it properly, the issue is from the incoming value from the filename (encoded in UTF-8) but the default codec in Python (for forban_share - line 243->250) is usually ascii and then
the filename is encoded back into b64 encoding library Python where UTF-8 is not appreciated...

I tested with some ".decode("utf-8").encode("latin-1")" but it doesn't work in a consistent among the Python version and especially regarding the site configuration of the encoding. If you have any ideas, let me know. I'll check some other ideas.

@MaStr
Copy link
Contributor Author

MaStr commented Jan 5, 2013

Is it possible to redefine the default encoding around decoding base64 and turn it back to ascii later?
import sys; sys.setdefaultencoding('utf-8')
Or what about reducing every filename (complete while hashing, searching and the whatever) to ascii?

I learned a few things in my System-Administration and Userhelp on IBM Websphere MQ about all this sh**** encoding stuff: You have to know which encoding enters your system and what you use inside (i.e. during modification).
I think one problem maybe a filename on the disc, not encoded in utf but having special character in i.e. ISO...-15 .

The complete platform independend steps should be something like this:

  1. Get Filename
  2. Convert Filename to UTF (if it already is, this shouldn't change anything)
  3. encode to base64
  4. decode to string in UTF (assuming you can accept UTF encoding while decoding)

If the normal base64.decode can't handle this well, you may try this library for encode and decode: http://docs.python.org/2.7/library/binascii.html?highlight=binascii#binascii

In a short overview it looks like an "convert any byte-array to hex" functionality. This should work like the default base64 function... with the flaw you have to convert back to string again.

@adulau
Copy link
Owner

adulau commented Jan 6, 2013

Thanks for the feedback.

That's exactly the step 4 that is an issue. The base64 modules of Python is also relying on the binascii module. I'll give another try.

@MaStr
Copy link
Contributor Author

MaStr commented Jan 31, 2013

Hi,
just found out, that this issue breaks the "remote download" functionality.
You are visting Forban on your box, click in the line ofanother Forban "browse" and then "get" you recieve a 404 error that /s/ is not available.

:(
Matthias

@toebbel
Copy link

toebbel commented Feb 8, 2013

Try this: Add the following lines to your app config.

tools.decode.on = True
tools.encode.on = True
tools.encode.encoding = "utf-8"
tools.decode.encoding = "utf-8"

via http://stackoverflow.com/a/4915497/359326

@adulau
Copy link
Owner

adulau commented Feb 8, 2013

Yep, I tried sometime ago but the result is variable depending of the Python 2 version and the platform where it's running. I'll build a set of test case to see where the origin of the issue is. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants