Show HN: Big List of Naughty Strings for testing user-input data

rspeer · on Aug 10, 2015

Most of what I do involves the messy world of text, and I think this is a great resource. I wish the software I depended on tested against it.

I can think of a few more cases that I've seen cause havoc:

- U+FEFF in the middle of a string (people are used to seeing it at the beginning of a string, because Microsoft, but elsewhere it may be more surprising)

- U+0 (it's encoded as the null byte!)

- U+1B (the codepoint for "escape")

- U+85 (Python's "codecs" module thinks this is a newline, while the "io" module and the Python 3 standard library don't)

- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when used in JSON literals)

- A glyph with a million combining marks on it, but not in NFC order (do your Unicode algorithms use insertion sort?)

- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)

- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and for some reason U+FDD0

People should also test what happens with isolated surrogate codepoints, such as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put them in the BLNS. (If you put the fake UTF-8 for them in a file, the best thing for a program to do would be to give up on reading the file.)

grapeshot · on Aug 10, 2015

Isolated UTF-16 surrogate code points definitely crash Unity when it tries to display them. (Seen when I pasted some emoji in a text box in TIS-100 and tried to backspace.)

zuzun · on Aug 10, 2015

BOMs have already caught me off guard at the start of strings.

gsnedders · on Aug 10, 2015

Bi-directional text is probably another one. All the bidi control characters, especially. Probably really all Unicode control characters in general.

rspeer · on Aug 10, 2015

Sure, but there's already a lot of bidi text in the file.

gsnedders · on Aug 10, 2015

Bah, I only saw mono-directional text. Looking closely I only see one line of with bi-directional text, "הָיְתָהtestالصفحات التّحول"?

minimaxir · on Aug 10, 2015

There are Bidi controller characters in the Trick Unicode. (Doesn't appear in Github rendering, oddly)

gizmo686 · on Aug 10, 2015

The range U+FDD0..U+FDEF is reserved for internal use by applications.

duskwuff · on Aug 10, 2015

It's supposed to be reserved for applications. In practice... you may see them in the wild anyway. So it's important that they show up in test vectors!

pwenzel · on Aug 10, 2015

This. Hidden white space has ruined my day before!

jsat · on Aug 10, 2015

" # Server Code Injection # # Strings which can cause user to run code on server as a privileged user (c.f. https://news.ycombinator.com/item?id=7665153)

/dev/null; rm -rf /*; echo " That's a little aggressive for testing no?

jleader · on Aug 10, 2015

Some would argue that if you're testing on a system you can't recreate easily/quickly, you're doing devops wrong.

pavel_lishin · on Aug 10, 2015

And I'd agree, but this would be a pretty disproportionate punishment for the crime of doing devops wrong :P

DonHopkins · on Aug 11, 2015

It's two crimes: doing devops wrong, and having a huge security hole.

rattray · on Aug 10, 2015

A lot of people do devops wrong... and I don't want to make those people even more scared to test things.

xtreme · on Aug 10, 2015

The problem is not restoring the system, but the time lost in figuring out what caused it without any logs.

minimaxir · on Aug 10, 2015

Accepted a pull request which is nicer.

cowls · on Aug 11, 2015

Likewise:

1;DROP TABLE users 1'; DROP TABLE users--

Seems a bit hairy to have that in there in case someone tries to run these tests on their prod environment

ckv428 · on Aug 12, 2015

Why would you test in a prod environment?

cowls · on Aug 12, 2015

Well the sample was testing against twitter (obviously prod).

So someone may test against prod if they didn't really know what these things could do.

MertsA · on Aug 13, 2015

it also won't work as most systems are going to require --no-preserve-root for that to do anything.

afandian · on Aug 10, 2015

One fun (and very interesting) string is EICAR[0]. I worked for an antivirus company once and we had the EICAR string for testing but couldn't check it into source control because it triggered the AV software which we dogfooded...

Is it naughty to include it here?

    X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

[0] https://en.wikipedia.org/wiki/EICAR_test_file

lamby · on Aug 11, 2015

I've submitted a PR [0] that also includes Spamassassin's GTUBE [1] which is designed to do a similar thing.

[0] https://github.com/minimaxir/big-list-of-naughty-strings/pul... [1] http://spamassassin.apache.org/gtube/

girvo · on Aug 11, 2015

Aw, Sophos on OS X doesn't think it's a threat.

afandian · on Aug 11, 2015

Without giving too much away, I was sufficiently surprised by that that I downloaded the Sophos for Mac Home Edition. It does recognise it.

Here's what I get: http://i.imgur.com/JQzVsQf.png

This was picked up by the on-access scanner and a manual scan. The Web Protection doesn't complain about the text in a page (rightly or wrongly).

Are you using a centrally managed version (i.e. not Home Edition)?

girvo · on Aug 13, 2015

Interestingly, I found what caused the false-negative. If I used Vim to create the file, it was picked up. If I "echo ...EICAR > text.txt" it doesn't get picked up, at least not immediately!

afandian · on Aug 13, 2015

The on-access scanner intercepts requests to open files, and scans them. Echo just writes to the file and closes it. It doesn't try to open it again once the EICAR string is in there. I'm speculating here, but Vim probably writes the file/buffer, flushes, and then tries to obtain a file handle to it. At that point an on-access scan will occur, and it will find the EICAR string.

A scheduled scan would pick this up eventually.

voltagex_ · on Aug 10, 2015

Fun times indeed. Windows defender picks up a test.txt with those contents as malicious (and closes the file handle causing Notepad to misbehave) but if you add a space between EI and CAR it doesn't see anything.

Edit: Seriously, Microsoft?

Category: Virus

Description: This program is dangerous and replicates by infecting other files.

Recommended action: Remove this software immediately.

Items: file:C:\Users\Adam\Desktop\test.txt

lmm · on Aug 10, 2015

Microsoft is doing the right thing. The whole point of that string is to trigger such behaviour. It's so you can use it to test that your antivirus is working.

voltagex_ · on Aug 10, 2015

Should it really tell the user that it's a dangerous file? Although if it didn't, malicious files could use that to their advantage.

rickmode · on Aug 11, 2015

Yes. Otherwise the only way to verify an anti malware system is working is with something actually malicious. So, you know, that's a bad plan. Think of system administrators deploying and validating a security package.

provemewrong · on Aug 11, 2015

From the same Wikipedia article:

>Anti-virus programmers set the EICAR string as a verified virus, similar to other identified signatures. A compliant virus scanner, when detecting the file, will respond in exactly the same manner as if it found a harmful virus. Not all virus scanners are compliant, and may not detect the file even when they are correctly configured.

efriese · on Aug 10, 2015

Yeah, I would make the SQL injection and command injections test a little less kinetic =). Using a simple SELECT test, like SELECT @@VERSION, would be a little safer... Edit: Forget to say thanks! This is a pretty cool list.

bryanlarsen · on Aug 10, 2015

You want something that modifies so that you can detect that the SQL executed. But an INSERT would be a much friendlier than a DROP TABLE. :)

efriese · on Aug 10, 2015

Not necessarily. If you do a test with good SQL and a second test with SQL Injection and compare the responses that can show SQL Injection exists without having to change the database. This won't work for all SQL injection tests, but I would rather take this approach first.

tptacek · on Aug 10, 2015

This is good. There are lots of lists like this; you might find additional strings to add to it here:

https://code.google.com/p/fuzzdb/

Fuzz lists are to web pentesters what drain snakes are to plumbers.

janfry · on Aug 10, 2015

Another good list that incorporates FuzzDB: https://github.com/danielmiessler/SecLists

As other commenters noted, strings like DROP TABLES should be used with caution!

simonw · on Aug 10, 2015

It's not completely clear to me which encoding the blns.txt file uses. Since this project is all about weird/evil bytestrings, the encoding of the file itself is very important.

Using a newline as a delimiter in that file excludes newlines from being part of the strings you are testing - but newlines are an important "naughty" character to consider. Unfortunately the same is true of basically any other common delimiter character.

Maybe base64-encoding the strings would be one way to solve for this? You could use base64-encoded values in JSON, for example.

minimaxir · on Aug 10, 2015

Fair question. Encoding is UTF-8. This is fine for time being since UTF-8 is ubiquitous.

I had it set as UTF-16 for the two-byte characters when first writing it, but that had caused issues. If there is a demand, a second list can be added.

adzicg · on Aug 10, 2015

for anyone testing web sites, I built a chrome extension that makes things like this available in the right-click menu [1] the code is on github, so it can be easily extended [2]

[1] - https://chrome.google.com/webstore/detail/bug-magnet/efhedld...

[2] - https://github.com/gojko/bugmagnet

acehyzer · on Aug 10, 2015

If I put this into my company's tests, we'd end up with no users... I have a lot of work ahead of me. :/

sanderjd · on Aug 10, 2015

Yeah, the other exploit strings do innocuous stuff like putting up javascript alerts or touching files, but the SQL injection ones aren't innocuous at all. I wonder if there's something better to replace those with. Something like `1'; CREATE TABLE blns ...--` would be more akin to what the shell exploits do.

reitanqild · on Aug 11, 2015

Anyone knows if anything similar exists for telephone numbers?

Edit: Found this two minutes later: https://github.com/googlei18n/libphonenumber, seems to be an official Google product and Apache licensed.

financequoll · on Aug 10, 2015

Unintentionally, this also shows that GitHub is going pretty well when it comes it sanitising user inputs.

duncans · on Aug 10, 2015

Thankfully they're not sanitising inputs but correctly encoding outputs.

orf · on Aug 10, 2015

Looks interesting, but the Script Injection, SQL Injection and Server Code Injection sections need a lot more samples to be remotely useful.

minimaxir · on Aug 10, 2015

I definitely agree; hence the open-sourceness. :)

I only added what was off the top of my head for those sections; this list will consistently be updated.

vog · on Aug 10, 2015

Wouldn't it make more sense to define building blocks and automatically generate all sensible combinations? Otherwise I don't think this list can be managed by hand, especially not in a volunteer project.

siculars · on Aug 10, 2015

Nice "in the beginning..." hebrew string:

בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ

gizmo686 · on Aug 10, 2015

Full context, this is the beggining of the bible.

minimaxir · on Aug 10, 2015

Yes, I'm lazy. :p

sam_goody · on Aug 11, 2015

You'd do better with "הבה נרדה ונבלה שם שפתם אשר לא ישמעו איש שפת ראהו" (Genesis 11:7) That's God saying he will make multiple languages to confuse everyone...

itaibn · on Aug 10, 2015

The list seems to be missing the simplest naughty string of all: The empty string!

(Well, the text file has empty lines separating the comments and example strings so it technically includes the empty string, but it's not in the JSON file.)

minimaxir · on Aug 10, 2015

There is a pull request pending that fixes this.

DonHopkins · on Aug 11, 2015

I also submitted a pull request with an infinitely long string, but it's still pending...

jl6 · on Aug 10, 2015

Is the scope just well-formed strings or would you consider adding binary nasties like null bytes, mal-encoded characters, or even just newlines on their own?

What about XML billion laughs strings, or parser-busting very long runs of parentheses?

eli · on Aug 10, 2015

I've definitely seen NUL bytes in what's supposed to be a text string break many tools.

hoprocker · on Aug 10, 2015

Nice; sort of a programming complement to Shutterstock's _List of Dirty, Naughty, Obscene, and Otherwise Bad Words_[0]. So helpful to have a bunch of minds working on useful lists like this. Good to see that GitHub passes this test!

[0] https://github.com/shutterstock/List-of-Dirty-Naughty-Obscen...

deanc · on Aug 10, 2015

I worked on a swear filter at a previous job. Not quite sure how this list could benefit anyone really unless you are matching a whole string e.g. title of a photo rather than words in the title of a photo.

There are so many creative ways to get around swearing. Replace letters with numbers, drop consonants and vowels. And you almost always need to check for word boundaries otherwise somebody from Scunthorpe might be upset you banned them. And then there are cases where word boundaries aren't enough. Good luck ;-)

brohee · on Aug 10, 2015

I doubt the value of this repository. The first naughty French word "allumé" can't be considered naughty, dirty, or bad, like, at all. And many others are not naughty under too many circumstances...

Except very few swear words, word filtering is pretty much useless.

joosters · on Aug 10, 2015

Alternative source: list of reddit usernames...

joelcollinsdc · on Aug 10, 2015

Great list. A few questions:

* How could this be used to test 'corrupt' characters? Isn't the process of savign the file itself as UTF-8 un-corrupt...the file?

* Is there some recommended way to group these into "strings that should pass validation" versus "strings that should fail"... or is that too application-specific?

pbnjay · on Aug 10, 2015

If you really intend this for use in testing, I'd suggest making the injections less nasty. I could easily see a junior dev slapping this in and deleting some important stuff.

I'd also add more invalid UTF encodings and embedded null bytes, etc. The JSON format would be preferable to plain text for that though.

ph0rque · on Aug 10, 2015

Thankfully, there are no strings invoking Cthulhu :)

tsemple · on Aug 10, 2015

lol! You must be referring to the ICFP contest 2015. http://icfpcontest.org/

ph0rque · on Aug 10, 2015

I was actually inspired by the concept that lovecraftian horrors can be accessed and interacted with programmatically, prominently featured in Stross's Atrocity Archives: https://en.wikipedia.org/wiki/The_Atrocity_Archives

userbinator · on Aug 11, 2015

/dev/urandom can also be used as a source of random and unusual input data, as it contains by definition all 256 byte values and 65536 2-byte values, 16M 3-byte values, etc., and should eventually output every possible string.

RandomBK · on Aug 11, 2015

> and should eventually output every possible string.

"Eventually" being the key word here. Fuzzing with purely random inputs will take eons to actually reveal non-trivial bugs...

x0 · on Aug 10, 2015

I absolutely love strange unicode strings. It's handy if you ever want to find out what a server's running. One time, I put a bunch of emoji's in a GET param of a Google site, then got a big Java error page. I had no idea Google ran Java.

Edit: Another one that tends to be fun is [] in the param, like http://example.com/?get[]=[].

And you can things inside, like http://example.com/?get['"%05<!]=[%FE%FF]

nradov · on Aug 10, 2015

For more great examples of "naughty" strings see the Twitter @glitchr_ account. https://twitter.com/glitchr_

ivanca · on Aug 10, 2015

Complete AI is no the hardest problem in CS, parsing text is. Joking aside this reminded me of that CSS vulnerability that allowed attackers to read peoples mails: http://scarybeastsecurity.blogspot.com/2009/12/generic-cross...

webo · on Aug 10, 2015

I don't deal with user input validation, but any resources for reading about handling various inputs like the ones in blns?

TallGuyShort · on Aug 10, 2015

I don't recall exactly where this was, but I know I've worked with an API before that sometimes dropped requests, and it was because some randomly generated data included 'naughty text' like 'xxx', or profanity. I was expecting a dataset intended to catch this problem...

ck2 · on Aug 10, 2015

OT but is there a way to see projects with the most stars on github?

This one seems to be skyrocketing.

Oh here we go, and lookie who is at the top: https://github.com/trending

tlrobinson · on Aug 10, 2015

https://github.com/stars?direction=desc&sort=stars

ck2 · on Aug 11, 2015

Well that is for your account, I meant overall for the system and the trending report does that...

homakov · on Aug 11, 2015

Should be 1 long string, then if something fails use bsection

rectangletangle · on Aug 10, 2015

This should be really handy for fuzz testing, nice work!

iopuy · on Aug 12, 2015

Bookmark