What 127.5 million forms can tell you about the state of front-end input validation on the Web

tags: string solving web security program analysis research
Posted on .

In which yours truly churns through the September/October 2023 archive of CommonCrawl to find all HTML forms validating inputs using regular expressions.

Note: This blog post started just after we wrapped up the work on Black Ostrich, and then there was a lot of babying that didn't let me finish it. That's why it's using 2023 data. Redoing the work is mostly automation, but takes many, many weeks, and I suspect the results will not have changed much.

One of my first research publications was about Black Ostrich, a web crawler that can fill out forms while scanning websites. It relies on the Ostrich string constraint solver (online version, be sure to set OSTRICH as the back-end for string solving!).

To test the crawler we needed, well, forms to fill out. We were particularly interested in the HTML 5 pattern attribute that allows validating input with arbitrary regular expressions. This led me to the CommonCrawl dataset which, for our purposes here, is a snapshot of the web. However, I didn’t have the means to handle the full data set at that time.

After the publication of the Black Ostrich paper I continued to tinker with our data gathering solutions and came up with a solution that fetches data from CommonCrawl, parses the web pages using the tl HTML parser, and extracts any form that contains <input> elements with either the pattern, data-val-regex-pattern, or the ng-pattern attribute. I also normalise the encoding to UTF-8.

Despite not technically being spec-compliant, tl was able to parse most of the CC-MAIN-2023-40 (September/October 2023) of CommonCrawl. The archive contains 3.40 billion web pages (3 384 335 454 to be exact) totalling of 98.38 TiB of compressed material, though that includes the entire raw HTTP conversation between the crawler and the server. By comparison, the resulting set of forms plus metadata is 54 GB compressed, large enough that just summarising the data takes considerable time. 51 152 471 (0.0151%) web pages in the dataset could not be parsed at all due to invalid HTML encoding, invalid character encodings, or bugs in the parser.

In total 97 105 071 (2.9%) of the URLs and 127 523 211 (27.8%) of the forms in the set contained <input> elements with one of the pattern attributes. So in general, forms are not very common on web pages, and pattern elements are only moderately common in forms. Note that a web page may contain any number of forms and that forms may repeat across URLs. A typical example would be an email registration form in a page footer or a search bar in a header. To avoid a potentially complex discussion about what constitutes a web page (everything across a domain? What about subdomains?), I used the URL as my base unit. So each instance of the same search bar counts.

Another wrench in the works is of course dynamism. Web developers love gratuitous JavaScript which means that any site that loads at least one script could arbitrarily manipulate the underlying HTML document before or during rendering. This is simply the nature of the Web. A different and equally interesting investigation would involve instrumenting a Web renderer to extract actually evaluated regexes, either in JavaScript or through the pattern attribute.

That would require a ludicrous amount of computer power and bandwidth far beyond my means. That's not this study.

Regexes on the web are boring, redundant, and massively duplicated

Regexes display a typical long-tail distribution, with boring regexes dominating. The two most common regexes, representing 62 962 778, 42% of all 150 012 734 collected patterns, are both equivalent to each other and to <input type="number">; /[0-9]*/ and \d*. In fact, all of top 10 is redundant with existing input types:

$ tac toplist.txt |head -10
51156392 [0-9]*
11806386 \d*
10595642 ^.+@.+\.[a-zA-Z]{2,63}$
8080740 .{3,50}
6878813 [0-9()#&amp;+*-=.]+
3228847 [0-9\-]*
2468792 .{5,}
2013633 ^[\w\.=-]+@[\w\.-]+\.[\w]{2,10}$
1904323 .{3,}
1715505 .{3,100}

This list has, in order:

  1. a number
  2. also a number
  3. an email address
  4. anything between 3 and 50 characters
  5. one or more from the following:
    • digits
    • any of the characters &, a, m, p, ; (likely a bug by them or a misparse by me, meant to represent &)
    • a plus (+)
    • any of the characters between * and =, (most of the programmer punctuation, and digits, redundantly enough)
    • a period (which is in the range above)
  6. numbers or dashes
  7. at least five characters
  8. (with redundant anchors) also an email address (somewhat more strict)
  9. at least three characters
  10. between three and a hundred characters

Note that number 5 seems to have at least one bug; the - is unescaped and so gives a range rather than, presumably, just adding the - character. This is particularly bad, because that range includes the HTML open bracket <. In at least some cases, the &amp; correctly turns into an & in my browser (Firefox), but it would not surprise me if that differs between browsers. A few samples suggest that these validations actually occur like this in the wild, and still appear at the time of writing. The following is a typical use:

<input size="1"
    type="tel"
    name="form_fields[email]"
    id="form-field-email"
    class="elementor-field elementor-size-xs  elementor-field-textual" placeholder="+7 (___) ___-__-__"
    required="required"
    aria-required="true"
    pattern="[0-9()#&amp;+*-=.]+"
    title="Only numbers and phone characters (#, -, *, etc) are accepted.">

Apparently, it's supposed to recognise phone numbers and seems to come from the Elementor WordPress builder, which would explain why the same regex occurs so frequently. It also seems that the code is embedded as hidden on many, many pages, further boosting the number of hits since my analysis cannot tell the difference between a visible and an invisible element. Elementor offered no straightforward way to report this bug, so I gave up.

Top ten is just shy of a hundred million of the patterns, 67% of all patterns. Top 100 is 83%. In total, there are only 64 296 unique patterns, which means that each pattern on average occurs 2 333 times. 62 327 of the patterns were valid in the sense that Node.js could parse them as a regex without errors.

78 390 URLs (290 792 input elements) contain a pattern attribute that is completely empty and thus accepts no input for that element. Note that in this case and in all other, this is the static analysis. There may be any number of JavaScripts manipulating the nodes in the DOM to add other kinds of validation. That seems like putting the cart before the horse to me, but then again so does almost all of modern web development.

How many programmers are using the wrong semantics in pattern attributes?

There is a semantic difference between the pattern attribute and regexes as experienced by regular programming languages. Because inputs are clearly validating input, patterns default to matching the entire input string. However, normal regex engines usually assume you are searching for the string and matches any substring, though the details are confusing and vary between implementations. That means that unless you tell the engine to always match the entire string with your regular expression for emails, they will happily accept any garbage as long as it contains an email address somewhere in there. Since this would be a dangerous default, the semantics are to always match the entire input in patterns.

This made me curious about how often programmers are confused about this. For anchors, this question can be answered by looking for regexes that unnecessarily use anchors to match entire strings in pattern attributes. This is a question that can be answered with, drumroll, a regular expression! Well, assuming we ignore escaping. If the regex matches the regex /^\^.*\$$/ it's a sure sign that the author wanted to be extra careful, didn't know about the semantics of the attribute, or were reusing code from the back-end for front-end validation.

$ rg --text '^\^.*\$$' --count  patterns.txt.valid
25175
$ sd '$\^(.*)\$$' '$1' patterns.txt.valid # This removes the anchors in-place

That's out of the 62 327 valid and unique patterns, so about half of them. The rest of the investigation simply removes them.

And now, let's bring out the power tools!

Breaking out the constraint solver

One of the nice things about Ostrich is that it contains an extension to the SMT-LIB constraint standard to parse and handle ECMA regular expressions. Well, not all of them; ECMA and other PCRE-derived regexes are in fact not regular (2022 paper) and cannot in theory be represented accurately for string-solving purposes. In practice, though, that's rarely a problem. Ostrich also contains cool tricks to some of the traditionally difficult/impossible regex features, developed for Black Ostrich. You can read about them in the director's cut version of our paper!

We can use this SMT-LIB code to have Ostrich ask "find us a string s that matches the literal regex R":

(declare-const s String)
(assert (str.in.re s (re.from_ecma2020 'R')))
(check-sat)

(Yes, SMT stems from early AI research, how can you tell?)

If you are familiar with SMT-LIB you may notice that both the re.from_ecma2020 function and the single-quote string notation ('regex') are nonstandard additions in Ostrich. That's because the regexes in SMT-LIB are textbook regexes without most of the features of full perl-compatible regular expressions, and because they end up needing a lot of escaping in practical use, some of which is nontrivial.

This doesn't do anything interesting though. The cool features of SMT only come out when we add other boolean connectives, allowing us to ask things like "get me a string that matches this regex but not that one". Now, let's ask some questions!

Note that in the following sections, whenever I count something I count the deduplicated, valid number of regexes, not the number of times they occur generally. There's a limit to how much machine power I'm interested in lighting on fire after all.

How many regexes would accept a trivial cross-site scripting attack?

XSS attacks (and other injections) are only dangerous if they get past the candy-floss security of front-end validation. The browser, after all, isn't real and can't hurt you. However, if we assume that the front-end programmers spoke to the back-end programmers and share ideas about valid input, weak front-end validation may reflect back-end validation. If the programmers took the Node bait and wrote their front-ends and back-ends in the same language they may even reuse the same regex for validation on both ends.

I wouldn't recommend this approach, since regexes are fidgety to write and worse to read (believe me, I've seen millions). They're easy to get wrong, and engines may differ in subtle but important ways. The fact that the top ten regexes contains at least one such subtle bug should suggest that regexes are, in fact, not a good tool for input validation.

We can use the following SMT-LIB (where we dump in the regex under test) to see if we could get a <script> tag through the regex:

(declare-const w String)
(assert (str.in.re w (re.from_ecma2020 {REGEX})))
(assert (str.contains w "<script>"))
(check-sat)

Here we use pattern semantics; the default in Ostrich since they are also constraints as opposed to searches. The regex is assumed to match the entire string. This means that we are using stronger semantics than what a typical backend would have applied for the same regex. Our Black Ostrich paper mentioned above does a more involved comparison of anchor handling.

After filtering out errors and timeouts, we have 6 707 true positives and 2 065 false positives. The false positives are caused by bugs or partial implementations in Ostrich, or possibly cases where regex engines do not agree.

That means that about 11% of the validation regexes used in websites would, if used on the back end too, and using the stricter semantics where the regex must match the entire string let through the most basic XSS attack imaginable. This shouldn't be entirely surprising; front-end validation is meant to help the user, not to implement security features.

Not a lot of variety in here:

$ xsv search --select 3 '^sat$' results.validated.csv | \
  xsv select 5 | sort | uniq | wc -l
  748

Most of them are only a <script> tag, but some of them reveal more interesting constraints, like ftp://youtube.com/<script> or const pattern = /<script>/; // regular expression for a minimum of 2 characters, or twitter.com/0/status/0?<script>.

How many regexes accept no or one string?

Similarly, we can ask if the regex accepts no strings or one single static string. To do this, we ask Ostrich to find two distinct strings matching the regex and require that neither be empty:

(declare-const w1 String)
(declare-const w2 String)
(declare-const re RegLan)
(assert (= re (re.from_ecma2020 '{REGEX}')))

(assert (str.in.re w1 re))
(assert (str.in.re w2 re))

(assert (> (str.len w1) 0))
(assert (> (str.len w2) 0))

(assert (not (= w1 w2)))
(check-sat)

Many of the initial cases were actually triggering shortcomings in Ostrich' handling of regexes. Notably, support for capture groups and non-greedy matching are missing or incomplete. I manually rewrote some of those regexes, which is straightforward when we only care about Boolean yes/no matching. The transformation probably should be done in Ostrich itself, but isn't yet there. That's research software for you! After my rewriting 2 065 regexes (out of 62 294) accepted one or fewer strings. Far more (16 833) returned errors.

For the record, running this experiment turned my supervisor's desktop into a space heater for ca 40 days, and may or may not have at some point crashed it. I'm very popular at my lab.

In essence, this set is mostly a number of false positives, mis-parses, etc.

Examples

Some of these are just static strings, including the Korean string "국가유공자 장례는 보훈상조" ("Patriots' funeral services are provided by Bohun Sangjo.", Via Kagi) and a number of double-escapes like \\[, along with my personal favourite email validator @. With the substring matching semantics outside of pattern= this would actually be pretty decent email validation, but for patterns this would match only one string: a single @.

A few of them are almost certainly shortcomings in Ostrich, while others seem to intend to match literal full strings, like ^(로그인하지 않고)$ ("without logging in"). Presumably they're from internal use.

A whole bunch of them also seem to be filled in by JavaScript later, since they contain placeholders like $ctrl.model.email or $ctrl.commonConstants.EMAIL_REGEX. Working on an archive makes the kind of symbolic execution I'd need to resolve that to work difficult.

Falsehoods programmers (in the wild, visibly) believe about email addresses

I was particularly interested in seeing how email inputs were validated. Validating email addresses is notoriously difficult (see for example this blog post, this FOSDEM 2018 talk, and another blog post). I should know. I recently switched all my email to an address @stjerna.space. It would surprise you how many web designers assume email goes to a TLD of at most 4 letters, and that's not even the difficult part of validating email addresses.

Generally, the best way to do validation in a Web form is to use type="email". That's still equivalent to to a regular expression that isn't technically correct but which is usually good enough:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

This doesn't allow the RFC-compliant email address "><script>alert(1)</script>"@amanda.systems, but, err, I can forgive that. You cannot email me there by the way. For some reason Fastmail wouldn't let me use that alias.

To get inputs containing email, I filtered out any form input with type, class, name, or id = email (equivalent to the CSS selector input[type=email],input[class=email],input[id=email], which is what I fed the HTML parser). This is a rather conservative estimate, since it rules out e.g. class="my_email", but it was much easier to implement and can be assumed to have few false positives.

$ echo "input[type=email],input[class=email],input[id=email]"  | cargo run --release --bin cc-analyse find-input > email-patterns.txt
$ sort email-patterns.txt | uniq --count | sort -nr | head -10

Gives us the following top 10:

  1. ^.+@.+\.[a-zA-Z]{2,63}$ (10 586 530)
  2. ^[\w.%+-]+@[\w.-]+\.[\w]{2,6}$(487 781)
  3. [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$(446 239)
  4. [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,63} (422 800)
  5. ^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+$ (394 474)
  6. ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ (307 597)
  7. [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$ (280 256)
  8. ^[a-zA-Z0-9.!#$%&’*+/=?^_{|}~-]+@[a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)+$` (263 291)
  9. [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$ (244 813)
  10. ^([^\x00-\x20\x22\x28\x29\x2c\x2e\x3a-\x3c\x3e\x40\x5b-\x5d\x7f-\xff]+|\x22([^\x0d\x22\x5c\x80-\xff]|\x5c[\x00-\x7f])*\x22)(\x2e([^\x00-\x20\x22\x28\x29\x2c\x2e\x3a-\x3c\x3e\x40\x5b-\x5d\x7f-\xff]+|\x22([^\x0d\x22\x5c\x80-\xff]|\x5c[\x00-\x7f])*\x22))*\x40([^\x00-\x20\x22\x28\x29\x2c\x2e\x3a-\x3c\x3e\x40\x5b-\x5d\x7f-\xff]+|\x5b([^\x0d\x5b-\x5d\x80-\xff]|\x5c[\x00-\x7f])*\x5d)(\x2e([^\x00-\x20\x22\x28\x29\x2c\x2e\x3a-\x3c\x3e\x40\x5b-\x5d\x7f-\xff]+|\x5b([^\x0d\x5b-\x5d\x80-\xff]|\x5c[\x00-\x7f])*\x5d))*(\.\w{2,})+$ (222 973)

After deduplication we go from 19 824 352 to just 6 250 regexes, out of which 6 057 were valid when parsed by Node.js. That's some duplication! It might be stemming from the same form occurring in many places (say, a footer with a subscription form for a mailing list), and it's probably aggravated slightly by the fact that I count multiple occurrences in the same tag.

How many of them can send email to stjerna.space?

Let's grind my email validation axe a bit more and ask Ostrich how common it is in the dataset to not be able to send email to me or my family. Let’s also take this opportunity to get fancy with the constraint solving.

We want to ask for an email address address that:

  • consists of a local part and domain part, where the domain part is just @stjerna.space
  • where the local part has no @signs, is not empty
  • where address does not match the the regex we are evaluating
(declare-const email String)
(declare-const re RegLan)

;; Find a string matching the regex...
(assert (= re (re.from_ecma2020 {REGEX})))
(assert (str.in.re email re))

;; ...With TLD stjerna.space:

(declare-fun local-part () String)

(assert (= email (str.++ local-part ”@stjerna.space”)))
(assert (> (str.len local-part) 0))
(assert (not (str.contains local-part ”@”)))

(check-sat)

This version was a lot more taxing for Ostrich judging by the larger number of timeouts. Most of the other experiments had none or a handful of timeouts, but this one is hitting the deep double digits. That’s probably in part due to the length constraint being rather expensive even for this trivial check for an empty string. Someone should really look into that! In hindsight, rewriting it as a regex might have been a better idea.

The failing results include my favourite enumeration of bad actors:

(?!gmail.com)(?!protonmail.com)(?!google.com)(?!deneme.com)(?!www.com)(?!test.com)(?!joomla.com)(?!wordpress.com)(?!pm.me)(?!mail.com)(?!zoho.com)(?!zohomail.com)(?!gmail.co)(?!fastmail.com)(?!yahoo.com)(?!hotmail.com)(?!yandex.com)(?!outlook)(?!icloud.com)(?!yandex)(?!icloud)(?!windowslive.com)(?!live.com)(?!aol.com)(?!me.com)(?!mail2world.com)(?!msn.com)(?!helpservicemail.com)(?!igsecurityemail.com)(?!igmail.support)(?!lightning-crypto.com)(?!igsécurity.com)(?![a-z]+\\.[(edu)(yandex)(icloud)]+\\.[a-z.])[a-z0-9.-]+\\.(?!edu)[a-z]{2,6}$

and the "no work email please": ^(?!info@|help@|sales@). These would not match because negative look-aheads are unsupported in Ostrich since they are complicated to implement for a constraint solver. They can be rewritten in some cases, but nobody's bothered because that's one of the many, many things that "aren't research".

A sloppy grep inverse match to filter out negative look-aheads and anchors (i.e. any input containing ^) leaves 759 regexes that didn't match. Many of those are placeholders that are presumably filled in by JavaScript (like validationPatterns.email). [object Object] is also on the list (but would only match a single character), as is the delightfully broken "no XSS please" [^&lt;&gt;]* and Your Email Here*.

A couple of these are assuming the wrong validation semantics, like @ (which would be a decent choice outside of pattern, which always matches the full input), or similar, longer versions. Others, like [@.]*, would allow for empty email addresses, which is probably not what you want. A large chunk seem to actually be looking for phone numbers or numeric values, all of which have their own input types that are far better.

Finally, there is (of course!) a large set of regexes that assume a 2-3 characters long TLD, like [a-z0-9._%+-]{1,40}[@]{1}[a-z]{1,10}[.]{1}[a-z]{2,3}. Some of them take the safer path and assume 2-4, for some reason.

Among the matching emails (most of which are a single character local part), the more interesting ones include:

  • hotmhotmhotmhotmhotmhotmhotmhotmhotmhotmhotma@stjerna.space (matching ^(?!.*hotmails)(?!.*cleardex).*$)
  • biznismbiznismbizna@stjerna.space (matching ^(?!.*biznismap).*$)
  • ac-grenobleafr@stjerna.space (matching ^(?=.*?\\bac-grenoble.fr\\b).*$)

Note that every single one of these misuse the pattern field by adding beginning and end of string anchors. Presumably the authors don't know that they aren't needed.

Conclusion: About 1 559 out of the ca-six-and-a-half-thousand can't email anyone on my TLD, while about 1 173 can.

How many of them could be replaced with type=email?

It's perfectly valid to add input filtering on top of type=email if you want to further restrict the input set. Maybe you hate people with gmail addresses or only want to allow email addresses at your local firm (both examples from before!). In that case, having a more restrictive pattern makes sense.

According to the documentation, type=email should be equivalent to this regex:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;

Let's see how often that is actually the case by asking Ostrich

(declare-const w String)
(assert (not (str.in.re w (re.from_ecma2020 {REGEX}))))
(assert (str.in.re w mdn-regex))
(check-sat)

This will ask Ostrich for a string that does not match our regex, but does match type="email", meaning that the regex is in some sense more restrictive than the default browser validation of email addresses. These regexes would actually do something if attached to an input element with type email, since they would exclude any string that satisfies the constraints.

It's also possible to have laxer validation by inverting the constraints (though not in combination with the semantic input type) but that's less interesting so I am leaving that experiment out in the interest of not needlessly burning compute resources.

This results in 5 358 satisfiable (at least one input rejected by the regex but accepted by the input type email), 271 unsat (rejects the same strings as input type email). Examples after validation on node.js regex engine include:

  • !@reply
  • {@0
  • !@yahoo.0
  • ericjonesmyemail@gmail-com

The final one is because apparently Eric Jones is a popular spammer identity if you search the Web for it!

The unsatisfiable cases notably include a bunch of very laxe regexes like (.)+ or (.*)@(.*). The first regex could have been a property (minlength="1") and the second is obviously redundant (and less semantic) than the email input type since it just requires an at sign with at least one character before and after it.

Unexpected findings and personal favourites

  • The full-text correct answers for a quiz
  • A 400 000-character list of place names, including some capitalised ones, separated by the OR operator (|). This pattern was surprisingly common and constituted most of the ludicrously long regexes, with topics ranging from nouns to TLDs or entire domains. I pity the poor web browser that has to parse them. That regex is long enough to smash Python's default maximum limit on column widths in CSV, which unlocked a nice detour for me. Python, why are you like this?
  • Lists of a small number (2-3) of specific email addresses not to match
  • "(?=[^.].{0,63}@) # Does not start with &#39;.&#39; and limit length", which has some sort of weird escaped comment

Because I use my .space email address, I have had to report a number of input validation bugs to sites where my address doesn't work. So far, my favourite response has been "our programmers are aware that the number of characters has increased, and are working on it. However, the change affects large parts of our system so we cannot give an estimate for when it is ready".

Who said web development was easy. Suddenly they increase the number of characters on you! (They did, in fact, not increase the number of characters, and you probably should not roll your own email validation code).

Data availability

My dataset of all the extracted HTML forms plus metadata is 54 GB compressed, and that's too large for Zenodo, so I don't have a way to make it available. However, it can be reproduced (including from newer data) using my data collection tool. In the future, if I find a better compression option or Zenodo expands their available storage, I will upload the dataset and link it here.