I need to generate file names from user inputted names. These names could be in any language. For example:
These are use inputted values, so I have no guarantee that the names don't contain characters that are invalid to be in file names. Users will be downloading these files from their browser, so I need to ensure the file names are valid on all operating systems in all configurations. I am currently doing this for English speaking countries by simply removing all non-alphanumeric characters with a simple regex:
Some example conversions:
Obviously this does not work internationally. I've considered finding/generating a blacklist of all characters that are invalid on all file systems and stripping those from the names. I've been unable to find a comprehensive list. I'd prefer to use existing code in a common library if possible. I imagine this is an already solved problem, however I can't find a solution that works internationally. The filename is for the user downloading the file, not for me. I'm not going to be storing these files. These files are dynamically generated by the server upon request from data in a database. The filenames are for the convenience of the person downloading the file. | |||||||||||
|
Regex Assuming that you want to filter user input for valid file-names by replacing invalid file-name characters such as
The output:
The valid filenames even with Unicode characters will be displayable on any webpage that supports UTF-8 encoding with the correct Unicode font. In addition, each will be the correct name for its file on any OS file-system that supports Unicode (tested OK on Windows XP, Windows 7). But, if you want to pass each valid filename as a URL string, make sure to encode it properly using | |||||||
|
Encode the filename as UTF-8, and then URL-encode the result.
| |||||
|
Windows appears to support unicode file names, I know Linux does, and apparently OS X does too. Presumably a well-written would fix invalid characters in a file name before saving it. It seems like you should be able to just use unicode file names. Is there some OS or browser that this doesn't work on? | |||
|
My advice would be to make it a requirement that your application runs on a platform that supports Unicode filenames. Most do these days. I don't think it is feasible to map from Unicode to an (unspecified) restricted character set, while still retaining human readability AND the original meaning AND avoiding collisions. Indeed, it is not even possible to do this mapping from Latin-1 to ASCII. If your application has to run on platforms that doesn't support Unicode filenames, then you will need to sacrifice human readability and/or meaning in the filenames in some cases. Besides, consider whether (for example) ASCII-ized chinese characters or Cyrilic letters or letters with accents stripped off are going to be acceptable to your end users. What I'd do is offer the user two options to select from:
In reality, if the user's machine doesn't support Unicode, they are going to have huge problems dealing with textual names that are not encoded using the machine's native encoding. There's no completely reliable way to find out what that is. Even if you have a semi-reliable way of figuring that out ... on the server side ... the problem of mapping all of Unicode to that encoding is intractable. It is better to encourage the user to upgrade his / her operating system to a Unicode capable one. | |||||||||||
|
Summarizing and paraphrasing @eee's answer...
(not joining multiple spaces into one!) | |||
Letting the input determine a file name without proper sanitizing seems prone to security attacks. You can use a hash function (SHA-1, MD5) to generate a valid filename. Just be aware that you can't derive the original name from the hash. Also, if you can have a simple lookup table, you can assign special identifiers to the names (like sequential numbers or GUIDs), and use the identifier as the filename. Another thing, have you thought about homonyms? | |||||||||||||
|
Your Answer
Not the answer you're looking for? Browse other questions tagged java unicodeinternationalization filenames valid or ask your own question.
tagged
asked | 1 year ago |
viewed | 874 times |
active |