java中剔除文件名中的非法字符

Question

I need to generate file names from user inputted names. These names could be in any language. For example:

"John Smith"
"高岡和子"
"محمد سعيد بن عبد العزيز الفلسطيني"

These are use inputted values, so I have no guarantee that the names don't contain characters that are invalid to be in file names.

Users will be downloading these files from their browser, so I need to ensure the file names are valid on all operating systems in all configurations.

I am currently doing this for English speaking countries by simply removing all non-alphanumeric characters with a simple regex:

string = string.replaceAll("[^a-zA-Z0-9]", "");
string = string.replaceAll("\\s+", "_")

Some example conversions:

"John Smith" -> "John_Smith.ext"
"John O'Henry" -> "John_OHenry.ext"
"John van Smith III" -> "John_van_Smith_III.ext"

Obviously this does not work internationally.

I've considered finding/generating a blacklist of all characters that are invalid on all file systems and stripping those from the names. I've been unable to find a comprehensive list.

I'd prefer to use existing code in a common library if possible. I imagine this is an already solved problem, however I can't find a solution that works internationally.

The filename is for the user downloading the file, not for me. I'm not going to be storing these files. These files are dynamically generated by the server upon request from data in a database. The filenames are for the convenience of the person downloading the file.

What happens if you just return a unicode file name? I would assume the operating system could figure this sort of thing out (but I wouldn't be surprised if some don't).
I know spaces are valid, I just prefer underscores to spaces.
Whitelists are better than blacklists.... It's hard to enumerate all evil...

eee · Accepted Answer · 2012-04-14 10:42:44Z

Regex [^a-zA-Z0-9] will filter non-ASCII characters which will omit Unicode characters or characters above 128 codepoints.

Assuming that you want to filter user input for valid file-names by replacing invalid file-name characters such as ? \ / : | < > * with underscore (_):

import java.io.UnsupportedEncodingException;

public class ReplaceI18N {

    public static void main(String[] args) {
        String[] names = {
                "John Smith",
                "高岡和子",
                "محمد سعيد بن عبد العزيز الفلسطيني",                
                "|J:o<h>n?Sm\\it/h*", 
                "高?岡和\\子*", 
                "محمد /سعيد بن عبد ?العزيز :الفلسطيني\\"
                };

        for(String s: names){
            String u  = s;
            try {
                u = new String(s.getBytes(), "UTF-8");
            } catch (UnsupportedEncodingException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } 
            u = u.replaceAll("[\\?\\\\/:|<>\\*]", " "); //filter ? \ / : | < > *
            u = u.replaceAll("\\s+", "_");
            System.out.println(s + " = " + u);
        }
    }
}

The output:

John Smith = John_Smith
高岡和子 = 高岡和子
محمد سعيد بن عبد العزيز الفلسطيني = محمد_سعيد_بن_عبد_العزيز_الفلسطيني
|J:o<h>n?Sm\it/h* = _J_o_h_n_Sm_it_h_
高?岡和\子* = 高_岡和_子_
محمد /سعيد بن عبد ?العزيز :الفلسطيني\ = محمد_سعيد_بن_عبد_العزيز_الفلسطيني_

The valid filenames even with Unicode characters will be displayable on any webpage that supports UTF-8 encoding with the correct Unicode font.

In addition, each will be the correct name for its file on any OS file-system that supports Unicode (tested OK on Windows XP, Windows 7).

i18n filenames

But, if you want to pass each valid filename as a URL string, make sure to encode it properly usingURLEncoder and later decode each encoded URL using URLDecoder.

@IgnacioVazquez-Abrams If we want to pass the valid filename as a URL string, then convert the string using URLEncoder like what you have suggested.

Ignacio Vazquez-Abrams · Answer 2 · 2012-04-14 04:02:34Z

up vote 0 down vote

Encode the filename as UTF-8, and then URL-encode the result.

'高岡和子' -> '%E9%AB%98%E5%B2%A1%E5%92%8C%E5%AD%90'

answered Apr 14 '12 at 4:02

Ignacio Vazquez-Abrams
232k 19 267 465

	The filename needs to contain the name in human readable fashion. In your example the file names needs to be "高岡和子.ext" – leros Apr 14 '12 at 4:03
	You encode on download. You store it normally locally. – Ignacio Vazquez-Abrams Apr 14 '12 at 4:04

Brendan Long · Answer 3 · 2012-04-14 04:03:47Z

up vote 0 down vote

Windows appears to support unicode file names, I know Linux does, and apparently OS X does too. Presumably a well-written would fix invalid characters in a file name before saving it.

It seems like you should be able to just use unicode file names. Is there some OS or browser that this doesn't work on?

answered Apr 14 '12 at 4:03

Brendan Long
18.8k 34 73

OS X supports them in a... non-conventional way though. – Ignacio Vazquez-Abrams Apr 14 '12 at 4:06

Stephen C · Answer 4 · 2012-04-14 11:54:03Z

My advice would be to make it a requirement that your application runs on a platform that supports Unicode filenames. Most do these days.

I don't think it is feasible to map from Unicode to an (unspecified) restricted character set, while still retaining human readability AND the original meaning AND avoiding collisions. Indeed, it is not even possible to do this mapping from Latin-1 to ASCII.

If your application has to run on platforms that doesn't support Unicode filenames, then you will need to sacrifice human readability and/or meaning in the filenames in some cases. Besides, consider whether (for example) ASCII-ized chinese characters or Cyrilic letters or letters with accents stripped off are going to be acceptable to your end users.

What I'd do is offer the user two options to select from:

An option that uses Unicode filenames for uploaded files. This should be the default, since most users' machines will support this.
A fallback option that uses generated names which are not related to the original strings / text.

In reality, if the user's machine doesn't support Unicode, they are going to have huge problems dealing with textual names that are not encoded using the machine's native encoding. There's no completely reliable way to find out what that is. Even if you have a semi-reliable way of figuring that out ... on the server side ... the problem of mapping all of Unicode to that encoding is intractable.

It is better to encourage the user to upgrade his / her operating system to a Unicode capable one.

"These files are not going to be saved server side. I'll be generating them on requests from data in a database."
That doesn't affect the point I'm making. If the user's platform can't represent the filenames, then attempting to map them to the user's machine's native character set is not going to give acceptable names ... if the user is a native speaker.
But... it's a web app. You can't force your users to only run Linux.
1) Most Windows systems support Unicode too. 2) If they don't then just map to non-meaningful names. One should not spend lots of effort in supporting users who haven't upgraded their OS for N years.
"You can't force your users to only run Linux.". Apart from being misleading, it is also illogical. It is like saying "you can't force your users to run a Windows browser" or "you can't force your users to run MS Word". In fact, many websites / organizations try to do exactly this kind of thing ... and get away with it.

einpoklum · Answer 5 · 2013-03-20 09:54:12Z

Summarizing and paraphrasing @eee's answer...

String sanitizeFilename(String unsanitized) {
     return unsanitized
                .replaceAll("[\\?\\\\/:|<>\\*]", " ") // filter out ? \ / : | < > *
                .replaceAll("\\s", "_");              // white space as underscores
}

(not joining multiple spaces into one!)

Jordão · Answer 6 · 2012-04-14 03:59:13Z

up vote 0 down vote

Letting the input determine a file name without proper sanitizing seems prone to security attacks. You can use a hash function (SHA-1, MD5) to generate a valid filename. Just be aware that you can't derive the original name from the hash.

Also, if you can have a simple lookup table, you can assign special identifiers to the names (like sequential numbers or GUIDs), and use the identifier as the filename.

Another thing, have you thought about homonyms?

answered Apr 14 '12 at 3:59

Jordão
23.2k 4 36 57

1

Or you could just sanitize the filename. – Ignacio Vazquez-Abrams Apr 14 '12 at 4:00

Sure............ – Jordão Apr 14 '12 at 4:00

The filename needs to contain the name in human readable form, which is why I can't just generate something. – leros Apr 14 '12 at 4:01

Very stringent requirements ... just be careful with file-system equivalents to Little Bobby Tables – Jordão Apr 14 '12 at 4:06

Its going to be a very common use case that a user would download 10-100+ files for various people. It's absolutely a requirement that the user be able to quickly find the file that corresponds to a person. – leros Apr 14 '12 at 4:10

show 4 more comments

asked	1 year ago
viewed	874 times
active	1 month ago

java中剔除文件名中的非法字符

How to convert strings in any language and character set to valid filenames in Java?

6 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java unicode internationalization filenames valid or ask your own question.

java中剔除文件名中的非法字符

How to convert strings in any language and character set to valid filenames in Java?

6 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java unicodeinternationalization filenames valid or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged java unicode internationalization filenames valid or ask your own question.