我需要从一个相当古老的系统生成的日志文件中提取单个记录,并让它们准备好进行数据库输入。这些平面文件都是我可以提取的(只是格式化查询花了几周时间)。这是一个有两个记录的文件的例子。我看到的唯一分隔符是“/ 11 S11-”,它本身在5个字符的常规位置,但不是在开头或结尾处。
对于那些观看,是的,这与my other newb question有关。我查看了python文档,一些google结果和一些related questions。所以,我的问题是
a)如何使用在记录中开始5个字符的分隔符?
b)如何抓住这些自然语言的大块?
c)如何摆脱换行符后的空格?这可能是最简单的部分:我可以在查询中指定每个字段有多长。现在,accessionDate长度为10个字符,accessionNumber长度为10个字符,patMedicalRecordNum长度为15个字符。所以finalDxText上的空格是35个字符。
01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
TOTAL GLEASON SCORE: GLEASON 5+4=9
TUMOR LOCATION: BILATERAL
TUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOR
EXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIOR
SEMINAL VESICLE INVASION: PRESENT
MARGINS: UNINVOLVED
LYMPHOVASCULAR INVASION: PRESENT
PERINEURAL INVASION: PRESENT
LYMPH NODES (SPECIMENS B AND C):
NUMBER EXAMINED: 25
NUMBER INVOLVED: 1
DIAMETER OF LARGEST METASTASIS: 1.7 mm
ADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,
ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE
CARCINOMA
PATHOLOGIC STAGE: pT3b N1 MX
B. LYMPH NODES, RIGHT PELVIC, EXCISION:
- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).
C. LYMPH NODES, LEFT PELVIC, EXCISION:
- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).
01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.
TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.
TUMOR LOCATION: BILATERAL.
EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.
MARGINS: NEGATIVE.
PERINEURAL INVASION: IDENTIFIED.
LYMPH-VASCULAR INVASION: NOT IDENTIFIED.
SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.
LYMPH NODES: NONE SUBMITTED.
OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.
PATHOLOGIC STAGE (pTNM): pT2c NX.