Java小强个人技术博客站点    手机版
当前位置: 首页 >> 开源 >> jsoup+正则,解析HTML并移除HTML标签

jsoup+正则,解析HTML并移除HTML标签

21920 开源 | 2021-4-27

jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup实现html5规范,并将HTML解析为与现代浏览器相同的DOM

1)从一个URL,文件或字符串中解析HTML

2)使用DOM或CSS选择器来查找、取出数据

3)可操作HTML元素、属性、文本

注意:jsoup是基于MIT协议发布的,可放心使用于商业项目。


jsoup入门示例程序(网络爬虫)

http://www.javacui.com/opensource/463.html 

Jsoup加载HTML的三种方式

http://www.javacui.com/opensource/464.html 


这里需要解析一个HTML内容的电子病历,经过分析,该病历每段以P标签来划分,然后里面是其具体内容,但是P标签内的HTML内容并没有特别明确的格式,想要一步步解析有些难度。

经过同事点醒,可以在获取到每段P标签的HTML内容后,通过正则来移除所有HTML标签,这样就只剩文本内容了,试了一下一段代码搞定了需求。


以下是这个病例提取到的HTML内容

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>电子病历手术记录</title></head>
  <body>
    <mate http-equiv="Content-Type" content="text/html; charset=utf-8" charset="utf-8"></mate>
    <div style="margin-bottom: 0px;">
      <p style="text-align: center;">
        <span style="font-family: 宋体, SimSun; font-size: 21px;">
          <strong>
            <span sde-model="" contenteditable="false" id="id1524386428512" name="name1524386428512" ele-keyname="43">
              <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
              <span title="43" style="color:#808080;" contenteditable="false">XXX医院</span>
              <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>
            <br></strong>
        </span>
      </p>
      <p style="text-align: center;">
        <span style="font-size: 12px;">姓名:
          <span sde-model="" contenteditable="false" id="id1524386542648" name="name1524386542648" ele-keyname="39">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="39" style="color:#808080;" contenteditable="false">XXX</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>&nbsp;性别:
          <span sde-model="" contenteditable="false" id="id1524386566826" name="name1524386566826" ele-keyname="35" keyval="W">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="35" style="color:#808080;" contenteditable="false">女</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>年龄:
          <span sde-model="" contenteditable="false" id="id1524386566827" name="name1524386566827" ele-keyname="34">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="34" style="color:#808080;" contenteditable="false">60</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>
          <span sde-model="{&quot;ID&quot;:&quot;id1569204207166&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;j&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;FF0808&quot;}" contenteditable="false" id="id1569204207166" name="name1569204207166" ele-keyname="3661" keyval="4">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="3661" style="color: rgb(128, 128, 128);" contenteditable="false">岁</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>科室:
          <span sde-model="" contenteditable="false" id="id1524386566828" name="name1524386566828" ele-keyname="30" keyval="18">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="30" style="color:#808080;" contenteditable="false">内科</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>病室:
          <span sde-model="" contenteditable="false" id="id1524386566829" name="name1524386566829" ele-keyname="2678" keyval="7">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="2678" style="color:#808080;" contenteditable="false">1</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>床号:
          <span sde-model="" contenteditable="false" id="id1524386566830" name="name1524386566830" ele-keyname="27" keyval="1298">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="27" style="color:#808080;" contenteditable="false">3</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>住院号:
          <span sde-model="" contenteditable="false" id="id1524386566831" name="name1524386566831" ele-keyname="17">
            <span style="color: rgb(128, 128, 128);" contenteditable="false">[</span>
            <span title="17" style="color:#808080;" contenteditable="false">52525252</span>
            <span style="color: rgb(128, 128, 128);" contenteditable="false">]</span></span>
        </span>
      </p>
      <hr></div>
    <p style="text-align: center;">
      <span style="font-size: 20px;">
        <strong>手术记录</strong></span>
    </p>
    <p>
      <span style="font-size: 16px;">
        <span sde-model="{&quot;ID&quot;:&quot;id1532241260523&quot;,&quot;TYPE&quot;:&quot;date&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;病程记录时间&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;病程记录时间&quot;,&quot;MAX&quot;:&quot;&quot;,&quot;MIN&quot;:&quot;&quot;,&quot;FORMAT&quot;:&quot;Y-m-d H:i:S&quot;,&quot;VALUE&quot;:&quot;&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;FF346A&quot;}" contenteditable="false" id="id1532241260523" name="name1532241260523" ele-keyname="3109">
          <span style="color:#808080;" contenteditable="false">[</span>
          <span title="3109" style="color:#808080;" contenteditable="true">2021-04-19 09:42:15</span>
          <span style="color:#808080;" contenteditable="false">]</span></span>
      </span>
      <span style="font-size: 20px;">
        <strong>
          <br></strong>
      </span>
    </p>
    <p>
      <span style="font-size: 16px; font-family: 宋体, SimSun;">
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">手术开始时间:</strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1532241220356&quot;,&quot;TYPE&quot;:&quot;date&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;日期时间&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;日期时间&quot;,&quot;MAX&quot;:&quot;&quot;,&quot;MIN&quot;:&quot;&quot;,&quot;FORMAT&quot;:&quot;Y-m-d H:i:S&quot;,&quot;VALUE&quot;:&quot;2021-04-19 09:42:17&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;FF346A&quot;}" contenteditable="false" id="id1532241220356" name="name1532241220356" ele-keyname="2564">
          <span style="color:#0000FF;" contenteditable="false">[</span>
          <span title="2564" style="color:rgb(0,0,0);" contenteditable="true">2021-04-19 09:42:17</span>
          <span style="color:#0000FF;" contenteditable="false">]</span></span>
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">手术结束时间:</strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1532241236113&quot;,&quot;TYPE&quot;:&quot;date&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;日期时间&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;日期时间&quot;,&quot;MAX&quot;:&quot;&quot;,&quot;MIN&quot;:&quot;&quot;,&quot;FORMAT&quot;:&quot;Y-m-d H:i:S&quot;,&quot;VALUE&quot;:&quot;2021-04-19 09:42:18&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;FF346A&quot;}" contenteditable="false" id="id1532241236113" name="name1532241236113" ele-keyname="2565">
          <span style="color:#0000FF;" contenteditable="false">[</span>
          <span title="2565" style="color:rgb(0,0,0);" contenteditable="true">2021-04-19 09:42:18</span>
          <span style="color:#0000FF;" contenteditable="false">]</span></span>
      </span>
    </p>
    <p>
      <span style="font-size: 16px; font-family: 宋体, SimSun;">
        <span contenteditable="false" style="font-weight: bold; white-space: nowrap; font-size: 16px; font-family: 宋体, SimSun;" id="98_8326a">术前诊断:</span>
        <span style="color:#0000FF;" contenteditable="false">{</span>
        <span outlineid="98" outlinekeyname="sqzd" style="color:gbk(0,0,0);" contenteditable="true">
          <span sde-model="{&quot;ID&quot;:&quot;id1615423213918&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;术前诊断&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;术前诊断&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;术前诊断&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615423213918" name="name1615423213918" ele-keyname="4368__" title="术前诊断" keycode="A00.900">
            <span style="color:#0000FF" contenteditable="false">[</span>
            <span title="术前诊断" style="color:#000000;" contenteditable="true">霍乱</span>
            <span style="color:#0000FF" contenteditable="false">]</span></span>
        </span>
        <span style="color:#0000FF;" contenteditable="false">}</span></span>
    </p>
      <span>&nbsp;
        <strong>
          <span contenteditable="false" name="name1566953853584">手术类别:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1566953853584&quot;,&quot;TYPE&quot;:&quot;select&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;手术类别&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;手术类别&quot;,&quot;REQUIRED&quot;:0,&quot;FREEINPUT&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;,&quot;VALUE&quot;:&quot;急诊手术&quot;,&quot;TEXT&quot;:&quot;急诊手术&quot;,&quot;REMOTEURL&quot;:&quot;&quot;,&quot;BINDINGDATA&quot;:[{&quot;VALUE&quot;:&quot;日间手术&quot;,&quot;TEXT&quot;:&quot;日间手术&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;急诊手术&quot;,&quot;TEXT&quot;:&quot;急诊手术&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;择期手术&quot;,&quot;TEXT&quot;:&quot;择期手术&quot;,&quot;SELECTED&quot;:0}]}" contenteditable="false" id="id1566953853584" name="name1566953853584" ele-keyname="3643">
          <span style="color:#0000FF;" contenteditable="false">[</span>
          <span title="3643" style="color: rgb(0, 0, 0);" contenteditable="true">急诊手术</span>
          <span style="color:#0000FF;" contenteditable="false">]</span></span>
        <strong>
          <span contenteditable="false" name="name1567158158826">是否微创:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1567158158826&quot;,&quot;TYPE&quot;:&quot;select&quot;,&quot;ISPRINT&quot;:&quot;Y&quot;,&quot;NAME&quot;:&quot;是否微创&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;是否微创&quot;,&quot;REQUIRED&quot;:0,&quot;FREEINPUT&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;,&quot;VALUE&quot;:&quot;是&quot;,&quot;TEXT&quot;:&quot;是&quot;,&quot;REMOTEURL&quot;:&quot;&quot;,&quot;BINDINGDATA&quot;:[{&quot;VALUE&quot;:&quot;是&quot;,&quot;TEXT&quot;:&quot;是&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;否&quot;,&quot;TEXT&quot;:&quot;否&quot;,&quot;SELECTED&quot;:0}]}" contenteditable="false" id="id1567158158826" name="name1567158158826" ele-keyname="3660">
          <span style="color:#0000FF;" contenteditable="false">[</span>
          <span title="3660" style="color: rgb(0, 0, 0);" contenteditable="true">是</span>
          <span style="color:#0000FF;" contenteditable="false">]</span></span>
      </span>
    </p>
    <p>
      <span style="font-size: 16px; font-family: 宋体, SimSun;">
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">
          <span contenteditable="false" name="name1534267325574">手术者指导者:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1615368733633&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;手术者指导者&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;手术者指导者&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;手术者指导者&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615368733633" name="name1615368733633" ele-keyname="3335__" title="手术者指导者" keyval="6">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="手术者指导者" style="color:#000000;" contenteditable="true">XX</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">&nbsp;
          <span contenteditable="false" name="name1532241467065">手术者:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1615368899407&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;手术者&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;手术者&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;手术者&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615368899407" name="name1615368899407" ele-keyname="2572__" title="手术者" isset="true" keyval="5,6">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="手术者" style="color:#000000;" contenteditable="true">WW,XX</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>&nbsp;&nbsp;
        <strong>
          <span contenteditable="false" name="name1562504319639">一助:</span></strong>
        <span contenteditable="false" name="name1562504319639">
          <span sde-model="{&quot;ID&quot;:&quot;id1615368944630&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;一助&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;一助&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;一助&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615368944630" name="name1615368944630" ele-keyname="3410__" title="一助" keyval="5,6">
            <span style="color:#0000FF" contenteditable="false">[</span>
            <span title="一助" style="color:#000000;" contenteditable="true">GG,HH</span>
            <span style="color:#0000FF" contenteditable="false">]</span></span>
        </span>&nbsp; &nbsp;
        <strong>
          <span contenteditable="false" name="name1562504319640">二助:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1615368997929&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;二助&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;二助&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;二助&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615368997929" name="name1615368997929" ele-keyname="3411__" title="二助" keyval="17">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="二助" style="color:#000000;" contenteditable="true">TTT</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>
        <br style="font-size: 16px; font-family: 宋体, SimSun;"></span>
    </p>
    <p>
      <strong>
        <span style="font-size: 16px; font-family: 宋体, SimSun;">手术麻醉方法:</span></strong>
      <span style="font-size: 16px; font-family: 宋体, SimSun;">
        <span sde-model="{&quot;ID&quot;:&quot;id1615430630235&quot;,&quot;TYPE&quot;:&quot;select&quot;,&quot;NAME&quot;:&quot;麻醉方法&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;麻醉方法&quot;,&quot;REQUIRED&quot;:0,&quot;FREEINPUT&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;,&quot;VALUE&quot;:&quot;11&quot;,&quot;TEXT&quot;:&quot;吸入麻醉&quot;,&quot;REMOTEURL&quot;:&quot;&quot;,&quot;BINDINGDATA&quot;:[{&quot;VALUE&quot;:&quot;1&quot;,&quot;TEXT&quot;:&quot;全身麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;11&quot;,&quot;TEXT&quot;:&quot;吸入麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;12&quot;,&quot;TEXT&quot;:&quot;静脉麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;13&quot;,&quot;TEXT&quot;:&quot;基础麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;2&quot;,&quot;TEXT&quot;:&quot;稚管内麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;21&quot;,&quot;TEXT&quot;:&quot;蛛网膜下腔阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;22&quot;,&quot;TEXT&quot;:&quot;硬脊膜外腔阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;3&quot;,&quot;TEXT&quot;:&quot;局部麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;31&quot;,&quot;TEXT&quot;:&quot;神经丛阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;32&quot;,&quot;TEXT&quot;:&quot;神经节阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;33&quot;,&quot;TEXT&quot;:&quot;神经阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;34&quot;,&quot;TEXT&quot;:&quot;区域阻滞麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;35&quot;,&quot;TEXT&quot;:&quot;局部浸润麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;36&quot;,&quot;TEXT&quot;:&quot;表面麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;4&quot;,&quot;TEXT&quot;:&quot;复合麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;41&quot;,&quot;TEXT&quot;:&quot;静吸复合全麻&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;42&quot;,&quot;TEXT&quot;:&quot;针药复合麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;43&quot;,&quot;TEXT&quot;:&quot;神经丛与硬膜外阻滞复合麻醉&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;44&quot;,&quot;TEXT&quot;:&quot;全麻复合全身降温&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;45&quot;,&quot;TEXT&quot;:&quot;全麻复合控制性降压&quot;,&quot;SELECTED&quot;:0},{&quot;VALUE&quot;:&quot;9&quot;,&quot;TEXT&quot;:&quot;其他麻醉方法&quot;,&quot;SELECTED&quot;:0}]}" contenteditable="false" id="id1615430630235" name="name1615430630235" ele-keyname="2631" title="麻醉方法" isset="true">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="麻醉方法" style="color: rgb(0, 0, 0);" contenteditable="false">吸入麻醉</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>&nbsp;
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">
          <span contenteditable="false" name="name1534267356013">麻醉指导者:</span></strong>
        <span contenteditable="false" name="name1534267356013">
          <span sde-model="{&quot;ID&quot;:&quot;id1615369061026&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;麻醉指导者&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;麻醉指导者&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;麻醉指导者&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615369061026" name="name1615369061026" ele-keyname="3336__" title="麻醉指导者" keyval="5">
            <span style="color:#0000FF" contenteditable="false">[</span>
            <span title="麻醉指导者" style="color:#000000;" contenteditable="true">WW</span>
            <span style="color:#0000FF" contenteditable="false">]</span></span>
        </span>&nbsp;
        <strong>
          <span contenteditable="false" name="name1532241729677" style="font-size: 16px; font-family: 宋体, SimSun;">麻醉者</span></strong>
        <span contenteditable="false" name="name1532241729677">:</span>
        <span sde-model="{&quot;ID&quot;:&quot;id1615369113136&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;麻醉者&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;麻醉者&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;麻醉者&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615369113136" name="name1615369113136" ele-keyname="2632__" title="麻醉者" isset="true" keyval="6">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="麻醉者" style="color:#000000;" contenteditable="true">BB</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">
          <br style="font-size: 16px; font-family: 宋体, SimSun;"></strong>
        <strong style="font-size: 16px; font-family: 宋体, SimSun;"></strong>
      </span>
    </p>
    <p>
      <span style="font-size: 16px; font-family: 宋体, SimSun;">
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">
          <span contenteditable="false" name="name1615369665198">操作方法描述:</span></strong>
        <span sde-model="{&quot;ID&quot;:&quot;id1615369665198&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;操作方法描述&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;操作方法描述&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;操作方法描述&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615369665198" name="name1615369665198" ele-keyname="98251265" title="操作方法描述">
          <span style="color:#0000FF" contenteditable="false">[</span>
          <span title="操作方法描述" style="color:#000000;" contenteditable="true">操作方法描述</span>
          <span style="color:#0000FF" contenteditable="false">]</span></span>
        <strong style="font-size: 16px; font-family: 宋体, SimSun;">
          <br></strong>
      </span>
    </p>
    <p>
      <span contenteditable="false" name="name1524470891570">
        <span contenteditable="false" name="name1533108118633">
          <strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</strong>&nbsp;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
        <span contenteditable="false" name="name1539756454252">医生:
          <span sde-model="{&quot;ID&quot;:&quot;id1615369381891&quot;,&quot;TYPE&quot;:&quot;text&quot;,&quot;NAME&quot;:&quot;医生&quot;,&quot;TAG&quot;:&quot;&quot;,&quot;DESCNAME&quot;:&quot;医生&quot;,&quot;VERIFYTYPE&quot;:&quot;text&quot;,&quot;VALUE&quot;:&quot;医生&quot;,&quot;REQUIRED&quot;:0,&quot;READONLY&quot;:0,&quot;COLOR&quot;:&quot;000000&quot;}" contenteditable="false" id="id1615369381891" name="name1615369381891" ele-keyname="3128__" title="医生" keyval="3">
            <span style="color:#0000FF" contenteditable="false">[</span>
            <span title="医生" style="color:#000000;" contenteditable="true">RRR</span>
            <span style="color:#0000FF" contenteditable="false">]</span></span>
        </span>&nbsp;</span>
      <br></p>
  </body>
</html>


打开页面效果如下

QQ截图20210427081825.jpg


我们不要CSS格式,只要这个页面显示的内容。


代码如下,POM引入

<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.12.1</version>
</dependency>

解析代码:

package com.example.demo;
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
 * JSOUP加正则替换HTML文件内所有标签
 * @author 崔素强
 */
public class HtmlParse {
	public static void main(String[] args) throws Exception {
		File input = new File("D:\\emr.html");
		Document doc = Jsoup.parse(input, "UTF-8");
		Elements ts = doc.getElementsByTag("p");
		for(int i=0;i<ts.size();i++) {
			Element t = ts.get(i);
			// 获得这段标签的整个HTML
			String str = t.html();
			// 替换所有以<开头以>结尾的内容
			String regex = "<([\\s\\S]*?)>";
			str = str.replaceAll(regex, "");
			// 替换一些其他字符 | . * [] \ { } 是特殊字符,在使用时要进行转义
			str = str.replaceAll("&nbsp;", "");
			str = str.replaceAll("\\[", "");
			str = str.replaceAll("\\]", "");
			str = str.replaceAll("\\{", "");
			str = str.replaceAll("\\}", "");
			System.out.println(str);
		}
	}
}

解析后输出

     XXX医院   
姓名:   XXX 性别:   女 年龄:   60    岁 科室:   内科 病室:   1 床号:   3 住院号:   52525252  
 手术记录
   2021-04-19 09:42:15      
 手术开始时间:   2021-04-19 09:42:17  手术结束时间:   2021-04-19 09:42:18  
 术前诊断:     霍乱   

  手术者指导者:   XX   手术者:   WW,XX   一助:    GG,HH     二助:   TTT  
 手术麻醉方法:    吸入麻醉   麻醉指导者:    WW    麻醉者 :   BB     
  操作方法描述:   操作方法描述    
                             医生:   RRR


END

推荐您阅读更多有关于“ html5 爬虫 解析器 jsoup html ”的文章

上一篇:使用dom4j操作XML的工具类 下一篇:微信查自己身份信息名下的微信账号

猜你喜欢

发表评论: