hdu 4782 模拟

Beautiful Soup

Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)
Total Submission(s): 1342 Accepted Submission(s): 260


Problem Description
  Coach Pang has a lot of hobbies. One of them is playing with “tag soup” with the help of Beautiful Soup. Coach Pang is satisfied with Beautiful Soup in every respect, except the prettify() method, which attempts to turn a soup into a nicely formatted string. He decides to rewrite the method to prettify a HTML document according to his personal preference. But Coach Pang is always very busy, so he gives this task to you. Considering that you do not know anything about “tag soup” or Beautiful Soup, Coach Pang kindly left some information with you:
  In Web development, “tag soup” refers to formatted markup written for a web page that is very much like HTML but does not consist of correct HTML syntax and document structure. In short, “tag soup” refers to messy HTML code.
  Beautiful Soup is a library for parsing HTML documents (including “tag soup”). It parses “tag soup” into regular HTML documents, and creates parse trees for the parsed pages.
  The parsed HTML documents obey the rules below.

HTML
  HTML stands for HyperText Markup Language.
  HTML is a markup language.
  A markup language is a set of markup tags.
  The tags describe document content.
  HTML documents consist of tags and texts.

Tags
  HTML is using tags for its syntax.
  A tag is composed with special characters: ‘<’, ‘>’ and ‘/’.
  Tags usually come in pairs, the opening tag and the closing tag.
  The opening tag starts with “<” and the tagname. It usually ends with a “>”.
  The closing tag starts with “</” and the same tagname as the corresponding opening tag. It ends with a “>”.
  There will not be any other angle brackets in the documents.
  Tagnames are strings containing only lowercase letters.
  Tags will contain no line break (‘\n’).
  Except tags, anything occured in the document is considered as text content.

Elements
  An element is everything from an opening tag to the matching closing tag (including the two tags).
  The element content is everything between the opening and the closing tag.
  Some elements may have no content. They’re called empty elements, like <hr></hr>.
  Empty elements can be closed in the opening tag, ending with a “/>” instead of “>”.
  All elements are closed either with a closing tag or in the opening tag.
  Elements can have attributes.
  Elements can be nested (can contain other elements).
  The <html> element is the container for all other elements, it will not have any attributes.

Attributes
  Attributes provide additional information about an element.
  Attributes are always specified in the opening tag after the tagname.
  Tag name and attributes are separated by single space.
  An element may have several attributes.
  Attributes come in name="value" pairs like class="icpc".
  There will not be any space around the '='.
  All attribute names are in lowercase.

A Simple Example <a href="http://icpc.baylor.edu/">ACM-ICPC</a>
  The <a> element defines an HTML link with the <a> tag.
  The link address is specified in the href attribute.
  The content of the element is the text “ACM-ICPC”
  
  You are feeling dizzy after reading all these, when Coach Pang shows up again. He starts to spout for hours about his personal preference and you catch his main points with difficulty. Coach Pang says:

  Your task is to write a program that will turn parsed HTML documents into formatted parse trees. You should print each tag or text content on its own line preceded by a number of spaces that indicate its depth in the parse tree. The depth of the root of the a parse tree (the <html> tag) is 0. He is satisfied with the tags, so you shouldn’t change anything of any tag. For text content, throw away unnecessary white spaces including space (ASCII code 32), tab (ASCII code 9) and newline (ASCII code 10), so that words (sequence of characters without white spaces) are separated by single space. There should not be any trailing space after each line nor any blank line in the output. The line contains only white spaces is also considered as blank line. You quickly realize that your only job is to deal with the white spaces.

Input
  The first line of the input is an integer T representing the number of test cases.
  Each test case is a valid HTML document starts with a <html> tag and ends with a </html> tag. See sample below for clarification of the input format.
  The size of the input file will not exceed 20KB.

Output
  For each test case, first output a line “Case #x:”, where x is the case number (starting from 1).
  Then you should write to the output the formatted parse trees as described above. See sample below for clarification of the output format.

Sample Input
  
  
2 <html><body> <h1>ACM ICPC</h1> <p>Hello<br/>World</p> </body></html> <html><body><p> Asia Chengdu Regional</p> <p class="icpc"> ACM-ICPC</p></body></html>

Sample Output
  
  
[pre]Case #1: <html> <body> <h1> ACM ICPC </h1> <p> Hello <br/> World </p> </body> </html> Case #2: <html> <body> <p> Asia Chengdu Regional </p> <p class="icpc"> ACM-ICPC </p> </body> </html> [/pre]
Hint
Please be careful of the number of leading spaces of each line in above sample output.

Source

/*就是将里面有意义的部分全部弄出来,归结起来可以看成4种,分别是open,close,Empty,String,根据有无<>以及/的位置就可以确定下是属于哪一种。然后题目最麻烦的其实就正如它文末的那句,“You quickly realize that your only job is to deal with the white spaces.” 
吃空格可以用下面的一句while解决  while((ch=getchar())&&ispace(ch));
根据第一个字符类型判断是string还是tag,如果是tag就不停地吃直到吃到的字符是'>',如果是string,就不停的吃吃到第一个分隔符。 注意的是‘<’会作为下一个str的第一个字符,所以要加个ok表示是否存起来。
*/
#include<stdio.h>
#include<string.h>
#define Empty 1
#define open 2
#define close 3
#define String 4
#define N 1000000
char str[N];
int type,ok;
bool ispace(char ch)
{
    return ch==32||ch==9||ch==10;//注意tab不可以这样子'   '表示,不然PE,醉啦。
}
bool judge(char ch)
{
    return ch==32||ch==9||ch==10||ch=='<';
}
void putspace(int n)
{
    int i;
    for(i=1; i<=n; i++)
        printf(" ");
}
void nextline()
{
    int k=0;
    char ch;
    if(ok==1)
    {
        str[k++]='<';
        ok=0;
    }
    else
    {
       while((ch=getchar())&&ispace(ch));
       if(ch=='\n')
       {//考虑一行末尾有空格或者tab,这时,不需要输出的
           type=-1;
           return;
       }
       str[k++]=ch;
    }
    if(str[0]=='<')
    {
        while((ch=getchar())&&ch!='>')
            str[k++]=ch;
        str[k++]='>';
        if(str[k-2]=='/') type=Empty;
        else if(str[1]=='/') type=close;
        else  type=open;
        str[k]='\0';
    }
    else
    {
        while((ch=getchar())&&!judge(ch))
        {
            str[k++]=ch;
        }
        if(ch=='<')//因为最后会把'<'吃掉
            ok=1;
        str[k]='\0';
        type=String;
    }
}
int main()
{
    int t,cnt=0,num,flag,lastype;
    scanf("%d",&t);
    lastype=-1;
    num=0;
    flag=1; ok=0;
    while(1)
    {
        if(flag==1)
        {
            printf("Case #%d:\n",++cnt);
            flag=0; //getchar();多写个getchar()答案错误,这东西,谨慎使用,我也是醉啦。
        }
        nextline();
        if(type==open)
        {
            if(lastype==String) printf("\n");
            putspace(num);
            printf("%s\n",str);
            num++;
        }
        else if(type==close)
        {
            //printf("%d\n",lastype);
            if(lastype==String)
            {
              if(strcmp(str,"</html>")!=0)
                printf("\n");
            }
            num--;
            putspace(num);
            printf("%s\n",str);
        }
        else if(type==Empty)
        {
            if(lastype==String)  printf("\n");
            putspace(num);
            printf("%s\n",str);
        }
        else if(type==String)
        {
            if(lastype==String)
                printf(" %s",str);
            else
            {
                putspace(num);
                printf("%s",str);
            }
        }
        else
        {
            lastype=-1;
            continue;
        }
        lastype=type;
        if(strcmp(str,"</html>")==0)
        {
            flag=1;
            if(cnt==t)
                break;
            lastype=-1;
            num=0; ok=0;
        }
    }
    return 0;
}



  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值