You've successfully subscribed to The Poor Coder | Hackerrank Solutions
Great! Next, complete checkout for full access to The Poor Coder | Hackerrank Solutions
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Hackerrank - Detect HTML Tags Solution

Hackerrank - Detect HTML Tags Solution

In this challenge, we're using regular expressions to detect the various tags used in an HTML document.

Beeze Aal
Beeze Aal

In this challenge, we're using regular expressions to detect the various tags used in an HTML document.

  • Tags come in pairs. Some tag name, , will have an opening tag, , followed by some intermediate text, followed by a closing tag, . The forward slash in a closing tag will always come before the tag name.
  • The exception to this is self-closing tags, which consist of a single tag (not a pair) with a forward slash after the tag name:

Here are a few examples of tags:

  • The  tag is for paragraphs:
  • There may be  or more spaces before or after a tag name:
  • A void or empty tag involves an opening and closing tag with no intermediate characters:

Some tags can also have attributes, such as the  tag, which is used to add a hyperlink to another document:

In the above case,  is the tag name and  is an attribute having the value .

Task
Given  lines of HTML, find the tag names (ignore any attributes) and print them as a single line of lexicographically ordered semicolon-separated values (e.g.: ).

Input Format

The first line contains an integer, , the number of HTML fragments.
Each of the  subsequent lines contains a fragment of an HTML document.

Constraints

  • Each fragment contains  ASCII characters.
  • The fragments are chosen from Wikipedia, so analyzing and observing their markup structure may help.
  • Leading and trailing spaces/indentation have been trimmed from the HTML fragments.

Output Format

Print a single line containing all of the unique tag names found in the input. Your output tags should be semicolon-separated and ordered lexicographically (i.e.: alphabetically). Do not print the same tag name more than once.

Sample Input

2
<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>
<div class="more-info"><a href="http://www.quackit.com/html/examples/html_links_examples.cfm">More Link Examples...</a></div>

Sample Output

a;div;p

Explanation

The first line contains  tag names: .
The second line contains  tag names: .
Our set of unique tag names is .
When we order these alphabetically and print them as semicolon-separated values, we get "".

Solution in python

import sys, re

input()
html = sys.stdin.read()
tags =  set(re.findall("<\s*([a-z0-9]+)", html))
tags = list(tags)
tags.sort()
print(";".join(tags))