商务合作
动脉网APP
可切换为仅中文
AbstractIn the pharmaceutical industry, there is an abundance of regulatory documents used to understand the current regulatory landscape and proactively make project decisions. Due to the size of these documents, it is helpful for project teams to have informative summaries. We propose a novel solution, MedicoVerse, to summarize such documents using advanced machine learning techniques.
摘要在制药行业,有大量的监管文件用于了解当前的监管环境并主动做出项目决策。由于这些文件的大小,项目团队有信息摘要是很有帮助的。我们提出了一种新的解决方案MedicoVerse,以使用先进的机器学习技术总结此类文档。
MedicoVerse uses a multi-stage approach, combining word embeddings using the SapBERT model on regulatory documents. These embeddings are put through a critical hierarchical agglomerative clustering step, and the clusters are organized through a custom data structure. Each cluster is summarized using the bart-large-cnn-samsum model, and each summary is merged to create a comprehensive summary of the original document.
MedicoVerse使用多阶段方法,在法规文件上使用SapBERT模型结合单词嵌入。。使用bart大型cnn samsum模型对每个集群进行总结,并合并每个摘要以创建原始文档的综合摘要。
We compare MedicoVerse results with established models T5, Google Pegasus, Facebook BART, and large language models such as Mixtral 8\(\times \)7b instruct, GPT 3.5, and Llama-2-70b by introducing a scoring system that considers four factors: ROUGE score, BERTScore, business entities and the Flesch Reading Ease.
我们通过引入考虑四个因素的评分系统,将MedicoVerse结果与已建立的模型T5,Google Pegasus,Facebook BART和大型语言模型(如Mixtral 8 \(\ times \)7b指令,GPT 3.5和Llama-2-70b)进行比较:ROUGE评分,BERTScore,商业实体和Flesch阅读轻松度。
Our results show that MedicoVerse outperforms the compared models, thus producing informative summaries of large regulatory documents..
我们的研究结果表明,MedicoVerse优于比较模型,从而产生了大型监管文件的信息摘要。。
IntroductionThe pharmaceutical industry has witnessed a remarkable surge in published literature, encompassing a diverse range of topics within the life sciences. This valuable repository of knowledge includes journals, academic publications, and research papers in fields such as medicine, genetics, epidemiology, and more.
引言制药行业已发表的文献数量激增,涵盖了生命科学领域的各种主题。这个宝贵的知识库包括医学、遗传学、流行病学等领域的期刊、学术出版物和研究论文。
Pharmaceutical science literature serves as a rich source of information, providing comprehensive insights into the latest advancements and discoveries in various life sciences domains. Researchers, medical professionals, and others rely on this extensive corpus of data to gain knowledge, enhance patient care, drug development, influence public health policies, etc., thereby playing a vital role in advancing scientific understanding and promoting evidence-based decision-making.
药学文献是丰富的信息来源,为各种生命科学领域的最新进展和发现提供了全面的见解。研究人员,医学专业人员和其他人依靠这些广泛的数据来获取知识,加强患者护理,药物开发,影响公共卫生政策等,从而在促进科学理解和促进循证决策方面发挥着至关重要的作用。
Despite the invaluable knowledge in biomedical science literature, researchers often face significant challenges in staying current and publications pose hurdles in efficiently analyzing vast volumes of text. Given this scenario, there is a need for an effective approach to streamline the information extraction process and locate the key contents of the text.
尽管生物医学科学文献具有宝贵的知识,但研究人员在保持最新方面经常面临重大挑战,出版物在有效分析大量文本方面存在障碍。鉴于这种情况,需要一种有效的方法来简化信息提取过程并定位文本的关键内容。
Text summarization which emerges as a promising solution in natural language processing (NLP) addresses the challenges posed by the extensive biomedical science literature..
文本摘要是自然语言处理(NLP)中一种很有前途的解决方案,它解决了大量生物医学文献带来的挑战。。
Text summarization1 is the process of condensing a large amount of text into a concise, informative summary without compromising the underlying meaning of the original text. There are two primary approaches to text summarization, extractive summarization2,3 and abstractive summarization4. In extractive summarization, key phrases and words are extracted from the raw text and then merged to generate a summary.
文本摘要1是将大量文本浓缩成简洁,信息丰富的摘要而不损害原始文本的潜在意义的过程。文本摘要有两种主要方法,提取摘要2,3和抽象摘要4。在提取摘要中,从原始文本中提取关键短语和单词,然后合并生成摘要。
Abstractive summarization works by creating new sentences that resonate with the meaning of the original text. This method involves a two-step approach of first selecting important phrases and then paraphrasing them. Both techniques are considered to be supervised machine learning problems.Research has demonstrated that neural network based abstractive summarization have been shown to achieve state-of-the art performance5,6.
抽象总结的工作原理是创建与原文含义产生共鸣的新句子。这种方法包括两个步骤,首先选择重要短语,然后对其进行释义。这两种技术都被认为是有监督的机器学习问题。研究表明,基于神经网络的抽象摘要已被证明可以实现最先进的性能5,6。
These methods often employ encoder-decoder architectures7, which are typical in sequence-to-sequence models. The addition of the attention mechanism in Transformers4,8,9,10,11,12 has significantly enhanced these models. Currently, abstractive text summarization is commonly achieved through Transformer-based models and their variations, as they offer reduced computational requirements and enable concurrent training13,14,15,16,17.
这些方法通常采用编码器-解码器体系结构7,这在序列到序列模型中是典型的。在变革者中添加注意机制4,8,9,10,11,12显着增强了这些模型。目前,抽象文本摘要通常通过基于变压器的模型及其变体来实现,因为它们提供了减少的计算需求并实现了并行训练13,14,15,16,17。
Early Transformer-based models for summarizing text were assessed using ROUGE scores, whereas newer models utilize BERTScore18 for tasks like text simplification19 and correcting grammatical errors20. Lately, there has been a shift towards using reinforcement learning to optimize rewards based on various evaluation criteria21,22, including the ROUGE-L score23.
早期基于变压器的文本总结模型使用ROUGE评分进行评估,而较新的模型使用BERTScore18进行文本简化19和纠正语法错误20等任务。。
The BERTSUM model14, which utilizes the BERT model24 for the Transformer encoder-decoder, has attained the highest performance on various datasets. Motivated by th.
BERTSUM模型14利用BERT模型24作为变压器编码器-解码器,在各种数据集上取得了最高的性能。由th激励。
Cluster 0 sentences discuss the water deficit in DKA and HHS, showing a thematic focus on the conditions impact on hydration levels.
第0组句子讨论了DKA和HHS的缺水情况,显示了对水合水平影响的主题关注。
Cluster 1 sentences cover the initial fluid therapy, its direction, and factors influencing the choice of fluid, indicating a focus on treatment steps.
第一组句子涵盖了最初的液体疗法,其方向以及影响液体选择的因素,表明重点是治疗步骤。
Cluster 2 sentences elaborate on the specifics of fluid therapy, such as the type of saline and rate of infusion.
第二组句子阐述了液体疗法的具体情况,例如盐水的类型和输注速率。
Cluster 3 sentences are the largest group and seems to discuss the management of the patients condition in a more comprehensive manner, covering various aspects of fluid therapy, insulin administration, and their effects.
第三组句子是最大的一组,似乎以更全面的方式讨论了患者病情的管理,涵盖了液体疗法,胰岛素给药及其效果的各个方面。
Cluster 4 has a single sentence that emphasizes the importance of considering urinary losses, a unique aspect not covered in other clusters.
第4组有一句话强调了考虑尿失禁的重要性,这是其他组中没有涵盖的独特方面。
Cluster 5 contains a defining sentence about DKA and HHS, which is likely quite distinct from the operational treatment discussions in the other clusters.
第5组包含一个关于DKA和HHS的定义语句,这可能与其他组中的操作治疗讨论截然不同。
Figure 3 shows the summary of a part of a regulatory document39 from Fig. 2. This summary indicates that key themes such as volume depletion in DKA and HHS, fluid therapy protocols, and the consideration of electrolytes and hydration status have been maintained. Specific details such as the rate and type of saline administration, the goal of therapy, and precautions with insulin administration in hypotensive patients have also been preserved.
图3显示了图2中监管文件39的一部分的摘要。该总结表明,DKA和HHS的体积消耗,液体治疗方案以及电解质和水合状态的考虑等关键主题得到了维持。还保留了诸如盐水给药的速率和类型,治疗目标以及低血压患者胰岛素给药的预防措施等具体细节。
The summary has reduced the text by almost 46%. This suggests that the clustering managed to eliminate redundant information while retaining the essence of the text. Over the 227 pieces of texts from 38 regulatory documents, an average of 37% reduction was seen in their summaries. The ability to condense content without sacrificing the quality or missing vital information showcases the effectiveness of the clustering approach.The summarizer also retains a high level of readability and presents the information in a manner that is accessible to both medical professionals and readers less familiar with the domain.
摘要将文本减少了近46%。这表明聚类能够消除冗余信息,同时保留文本的本质。在38份监管文件的227条文本中,其摘要平均减少了37%。在不牺牲质量或丢失重要信息的情况下浓缩内容的能力证明了聚类方法的有效性。摘要生成器还保留了较高的可读性,并以医学专业人员和不太熟悉该领域的读者都可以访问的方式呈现信息。
It demonstrates the potential of advanced natural language processing techniques to support researchers in distilling and communicating complex datasets effectively.Evaluation metricsA multifaceted approach was employed for a comprehensive evaluation of MedicoVerse. We utilized the ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L) to assess the quality of the generated summaries.
它展示了先进的自然语言处理技术的潜力,以支持研究人员有效地提取和通信复杂数据集。评估指标采用多方面的方法对MedicoVerse进行综合评估。我们利用ROUGE指标(ROUGE-1,ROUGE-2,ROUGE-1)来评估生成摘要的质量。
ROUGE44,45,46 (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for the automatic evaluation of text summarization and machine translation systems. The ROUGE metrics measure the quality of the generated summary or translation by comparing it to one or more reference summaries or translations.
ROUGE44,45,46(面向回忆的注册评估替补)是一组用于自动评估文本摘要和机器翻译系统的指标。ROUGE指标通过将生成的摘要或翻译与一个或多个参考摘要或翻译进行比较来衡量其质量。
The scoring ranges from 0 to 1, with 1.
。
(1)
(1)
$$\begin{aligned} & \begin{aligned} \text {ROUGE F1 Score }&\text {Weighted Average} \\&= 0.2 \times (\text {ROUGE-1 F1 Score})\\&\quad + 0.2 \times (\text {ROUGE-2 F1 Score}) \\&\quad + 0.6 \times (\text {ROUGE-L F1 Score}). \end{aligned} \end{aligned}$$
$$\开始{对齐}&\开始{对齐}\文本{胭脂F1得分}&\文本{加权平均数}\&=0.2倍(\文本{胭脂1 F1得分})\\&\四倍+0.2倍(\文本{胭脂2 F1得分})\\&\四倍+0.6倍(\文本{胭脂1 F1得分})。\结束{对齐}\结束{对齐}$$
(2)
(2)
This final score of Eq. (1) was used to compare multiple summarization models such as bart-large-cnn-samsum, Facebook BART, Google Pegasus47, T5, Mixtral 8\(\times \)7b instruct, GPT 3.5, and Llama-2-70b in addition to our own approach. The weights of Eqs. (1) and (2) were determined empirically.EvaluationThis paper’s primary objective is to assess the effectiveness of various summarization approaches and their ability to generate concise and coherent summaries for regulatory and PubMed documents.
方程式(1)的最终得分用于比较多种摘要模型,例如bart大型cnn samsum,Facebook bart,Google Pegasus47,T5,Mixtral 8 \(\ times \)7b指令,GPT 3.5和Llama-2-70b以及我们自己的方法。公式(1)和(2)的权重是根据经验确定的。评估本文的主要目标是评估各种摘要方法的有效性及其为法规和PubMed文件生成简明一致摘要的能力。
We introduced a novel scoring technique to evaluate the effectiveness of our approach, integrating four key metrics: ROUGE, BERTScore, Unique Business KPIs, and Flesch Reading Ease. A detailed analysis of the results from our evaluation is illustrated through performance Tables 1 and 2, with discussions of key findings, and comparisons to existing approaches.
我们引入了一种新颖的评分技术来评估我们的方法的有效性,整合了四个关键指标:ROUGE,BERTScore,独特的业务KPI和Flesch Reading Ease。绩效表1和表2对我们的评估结果进行了详细分析,讨论了主要发现,并与现有方法进行了比较。
We also delve into the implications of our findings and their potential applications in pharmaceutical and biomedical sciences.Table 1 ROUGE scores for different models on summaries of a part of regulatory document48.Full size tableTable 2 Average score across all the models for ten sampled data of regulatory documents.Full size tableAccording to Table 1, the MedicoVerse model demonstrates robust performance across all ROUGE metrics, consistently achieving scores exceeding 0.4.
我们还深入研究了我们的发现及其在制药和生物医学科学中的潜在应用。。全尺寸表根据表1,MedicoVerse模型在所有胭脂指标中表现出强大的性能,始终取得超过0.4的分数。
The MedicoVerse model displays recall and precision scores of 0.57, 0.46, 0.56, 0.86, 0.73, 0.85, exhibiting a balance between precision and recall scores compared to the other models. The ability to strike this balance ensures that the generated summaries contain essential information while avoiding excessive inclusion of irrelevant terms.On the contrary, models such as philschmid/bart-large-cnn-samsum, Facebook/bart-large-cnn, Goo.
与其他模型相比,MedicoVerse模型的召回率和召回率得分分别为0.57、0.46、0.56、0.86、0.73、0.85,显示出精度和召回率得分之间的平衡。达到这种平衡的能力确保生成的摘要包含基本信息,同时避免过度包含不相关的术语。相反,philschmid/bart large cnn samsum,Facebook/bart large cnn,Goo等模型。
(3)
(3)
The Reading Ease score of Eq. (3) ranges from 0 to 100, with high values indicating easier readability.ConclusionThe MedicoVerse text summarizer stands out as a pioneering solution in the field of pharmaceutical sciences, leveraging hierarchical agglomerative clustering with SapBERT embeddings and the philschmid/BART-large-cnn-samsum model.
。结论MedicoVerse文本摘要生成器是制药科学领域的一个开创性解决方案,它利用了SapBERT嵌入和philschmid/BART大型cnn samsum模型的层次聚集聚类。
Notably, it captures key business entities while maintaining a balance between precision and recall across various ROUGE metrics. Moreover, the summaries produced by MedicoVerse demonstrate a high readability index, falling within the moderate-to-easy range. This ensures that the conveyed information is accessible and understandable to a broad audience.
值得注意的是,它捕获了关键的业务实体,同时在各种胭脂指标之间保持了精确度和召回率之间的平衡。此外,MedicoVerse产生的摘要显示出较高的可读性指数,属于中等到简单的范围。这确保了所传达的信息为广大受众所接受和理解。
An advantage of MedicoVerse is in its performance, coupled with the accessibility of free-to-use, easy-to-understand summaries enriched with relevant context. Such attributes position MedicoVerse as a promising option for both researchers and practitioners in pharmaceutical sciences. In comparison to other models evaluated in our analysis, hierarchical clustering with Mixtral 8\(\times \)7b emerges as the second-best performer, offering concise and domain-specific summaries.
。这些属性使MedicoVerse成为制药科学研究人员和从业者的有前途的选择。与我们分析中评估的其他模型相比,使用Mixtral 8 \(\ times \)7b的层次聚类表现第二好,提供了简洁且特定领域的摘要。
However, it is important to note that models such as GPT 3.5 with hierarchical clustering produces text summaries which will be useful for a broader audience. The model Llama-2-70b produces lengthy summaries that may lack the precision and relevance required within the pharmaceutical science domain.
。Llama-2-70b模型产生了冗长的摘要,可能缺乏制药科学领域所需的准确性和相关性。
Therefore, by leveraging MedicoVerse or similar state-of-the-art models integrated with hierarchical clustering, researchers and practitioners can efficiently distill vast amounts of information into concise and insightful summaries.As the landscape of large language models is rapidly evolving, there is.
因此,通过利用MedicoVerse或类似的与层次聚类相结合的最先进模型,研究人员和从业者可以有效地将大量信息提取成简洁而有见地的摘要。随着大型语言模型的快速发展,出现了。
Data availability
数据可用性
The data analyzed in this study may be made available upon request. Contact contributing author Sumit Ranjan for requests around the data.
本研究中分析的数据可应要求提供。有关数据的请求,请联系撰稿人Sumit Ranjan。
Code availability
代码可用性
The code used in this study may be made available upon request. Contact contributing author Sumit Ranjan for requests around the code.
本研究中使用的代码可根据要求提供。请联系贡献作者Sumit Ranjan以获取有关代码的请求。
ReferencesBui, D., Del Fiol, G., Hurdle, J. & Jonnalagadda, S. Extractive text summarization system to aid data extraction from full text in systematic review development. J. Biomed. Inform. 64, 265–272 (2016).Article
ReferencesBui,D.,Del Fiol,G.,Buller,J。&Jonnalagadda,S。提取文本摘要系统,以帮助在系统评价开发中从全文中提取数据。J、 生物医学。通知。64265-272(2016)。文章
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Alguliev, R. & Aliguliyev, R. Evolutionary algorithm for extractive text summarization. Intell. Inf. Manag. 1, 128–138 (2009).
Alguliev,R。&Aliguliyev,R。用于提取文本摘要的进化算法。因特尔。信息管理。1128-138(2009)。
Google Scholar
谷歌学者
Sinha, A., Yadav, A. & Gahlot, A. Extractive Text Summarization Using Neural Networks (2018). Preprint at https://arxiv.org/abs/1802.10137.Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B. & Dos santos, C. N. Abstractive Text Summarization Using Sequence-to-Sequence rnns and Beyond. (2016).
Sinha,A.,Yadav,A。&Gahlot,A。使用神经网络进行提取文本摘要(2018)。预印于https://arxiv.org/abs/1802.10137.Nallapati,R.,Zhou,B.,Gulcehre,C.,Xiang,B。和Dos santos,C.N。使用序列到序列RNN及其后的抽象文本摘要。(2016年)。
Preprint at https://arxiv.org/abs/1602.06023.Lin, H. & Ng, V. Abstractive summarization: A survey of the state of the art. Proc. AAAI Conf. Artif. Intell. 33, 9815–9822 (2019)..
预印于https://arxiv.org/abs/1602.06023.Lin。AAAI配置文件。因特尔。339815-9822(2019)。。
Google Scholar
谷歌学者
Gupta, S. & Gupta, S. K. Abstractive summarization: An overview of the state of the art. Expert Syst. Appl. 121, 49–65 (2019).Article
Gupta,S。&Gupta,S.K。抽象总结:最新技术概述。专家系统。应用。121,49-65(2019)。文章
Google Scholar
谷歌学者
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. Advances in neural information processing systems27 (2014).Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014).Luong, M.-T., Pham, H.
萨茨凯(I.Sutskever),维尼尔斯(Vinyals),O.&Le,Q.V。使用神经网络进行序列间学习。神经信息处理系统的进展27(2014)。Bahdanau,D.,Cho,K。&Bengio,Y。通过共同学习对齐和翻译来进行神经机器翻译。arXiv预印本arXiv:1409.0473(2014)。朗,M.-T.,范,H。
& Manning, C. D. Effective Approaches to Attention-Based Neural Machine Translation. arXiv preprint arXiv:1508.04025 (2015).See, A., Liu, P. J. & Manning, C. D. Get to the Point: Summarization with Pointer-Generator Networks. arXiv preprint arXiv:1704.04368 (2017).Cohan, A. et al. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents.
&曼宁,C.D。基于注意的神经机器翻译的有效方法。arXiv预印本arXiv:1508.04025(2015)。参见,A.,Liu,P.J.&曼宁,C.D。切中要害:使用指针生成器网络进行摘要。arXiv预印本arXiv:1704.04368(2017)。。一种用于长文档抽象摘要的话语感知注意模型。
arXiv preprint arXiv:1804.05685 (2018).Vaswani, A. et al. Attention is All you Need. (2017). Preprint at https://arxiv.org/abs/1706.03762.Zhang, H., Xu, J. & Wang, J. Pretraining-Based Natural Language Generation for Text Summarization. arXiv preprint arXiv:1902.09243 (2019).Liu, Y. & Lapata, M. Text Summarization with Pretrained Encoders.
arXiv预印本arXiv:1804.05685(2018)。Vaswani,A。等人。。(2017年)。预印于https://arxiv.org/abs/1706.03762.Zhang,H.,Xu,J。&Wang,J。基于预训练的文本摘要自然语言生成。arXiv预印本arXiv:1902.09243(2019)。Liu,Y。&Lapata,M。使用预训练编码器进行文本摘要。
arXiv preprint arXiv:1908.08345 (2019).You, Y., Jia, W., Liu, T. & Yang, W. Improving abstractive document summarization with salient information modeling. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2132–2141 (2019).Xu, S. et al. Self-attention guided copy mechanism for abstractive summarization.
arXiv预印本arXiv:1908.08345(2019)。You,Y.,Jia,W.,Liu,T。&Yang,W。用显着信息建模改进抽象文档摘要。计算语言学协会第57届年会论文集2132-2141(2019)。Xu,S.等人。抽象摘要的自我注意引导复制机制。
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 1355–1362 (2020).Pilault, J., Li, R., Subramanian, S. & Pal, C. On extractive and abstractive neural document summarization with transformer language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 9308–9319 (2020).Zhang, M.
计算语言学协会第58届年会论文集1355-1362(2020)。Pilault,J.,Li,R.,Subramanian,S。&Pal,C。关于变压器语言模型的提取和抽象神经文档摘要。2020年自然语言处理经验方法会议论文集(EMNLP)9308-9319(2020)。张,M。
et al. Bertscore: Evaluating Text Generation with.
等人。Bertscore:评估文本生成。
Google Scholar
谷歌学者
Bryant, C. et al. Grammatical error correction: A survey of the state of the art. Comput. Linguist. 49, 643–701. https://doi.org/10.1162/coli_a_00478 (2023).Article
布莱恩特,C。等人,《语法纠错:最新技术的调查》。计算机。语言学家。49643-701。https://doi.org/10.1162/coli_a_00478(2023年)。文章
Google Scholar
谷歌学者
Li, Y. Deep Reinforcement Learning: An Overview. arXiv preprint arXiv:1701.07274 (2017).Keneshloo, Y., Shi, T., Ramakrishnan, N. & Reddy, C. K. Deep reinforcement learning for sequence-to-sequence models. IEEE Transact. Neural Netw. Learn. Syst. 31, 2469–2489 (2019).
Li,Y。深度强化学习:概述。arXiv预印本arXiv:1701.07274(2017)。Keneshloo,Y.,Shi,T.,Ramakrishnan,N。&Reddy,C.K。序列间模型的深度强化学习。IEEE交易。神经网络。学习。系统。312469-2489(2019)。
Google Scholar
谷歌学者
Paulus, R., Xiong, C. & Socher, R. A Deep Reinforced Model for Abstractive Summarization. arXiv preprint arXiv:1705.04304 (2017).Devlin, J., Chang, M., Lee, K. & Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). Preprint at https://arxiv.org/abs/1810.04805.Ward, J.
Paulus,R.,Xiong,C。和Socher,R。一个用于抽象总结的深度强化模型。arXiv预印本arXiv:1705.04304(2017)。Devlin,J.,Chang,M.,Lee,K。和Toutanova,K。Bert:深度双向变压器的语言理解预训练。(2018年)。预印于https://arxiv.org/abs/1810.04805.Ward,J。
H. Hierarchical grouping to optimize an objective function. J Am. Stat. Assoc. 58, 236–244 (1963).Article .
H、 分层分组以优化目标函数。《美国统计协会杂志》第58236-244页(1963年)。文章。
MathSciNet
MathSciNet
Google Scholar
谷歌学者
Gowda, K. C. & Krishna, G. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognit. 10, 105–112. https://doi.org/10.1016/0031-3203(78)90018-3 (1978).Article
Gowda,K.C。&Krishna,G。使用相互最近邻概念的凝聚聚类。模式识别。10105-112。https://doi.org/10.1016/0031-3203(78)90018-3(1978)。文章
ADS
广告
Google Scholar
谷歌学者
Patel, K., Patel, D., Golakiya, M., Bhattacharyya, P. & Birari, N. Cohen, K., Demner-Fushman, D., Ananiadou, S. & Tsujii, J. (eds) Adapting Pre-trained Word Embeddings for Use in Medical Coding. (eds Cohen, K., Demner-Fushman, D., Ananiadou, S. & Tsujii, J.) BioNLP 2017 (Association for Computational Linguistics, Vancouver, Canada, 2017).Wang, Y.
Patel,K.,Patel,D.,Golakiya,M.,Bhattacharyya,P。&Birari,N。Cohen,K.,Demner-Fushman,D.,Ananiadou,S。&Tsujii,J。(编辑)适应预先训练的单词嵌入用于医学编码。(eds Cohen,K.,Demner-Fushman,D.,Ananiadou,S.&Tsujii,J。)BioNLP 2017(加拿大温哥华计算语言学协会,2017)。王,Y。
et al. A comparison of word embeddings for biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018).Article .
等。生物医学自然语言处理中单词嵌入的比较。J、 生物医学。通知。87,12-20(2018)。文章。
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Ushioda, A. Scott, D. (ed.) Hierarchical clustering of words and application to NLP tasks. (ed.Scott, D.) Fourth Workshop on Very Large Corpora (Association for Computational Linguistics, Herstmonceux Castle, Sussex, UK, 1996).Murtagh, F. & Contreras, P. Methods of Hierarchical Clustering.
Ushioda,A。Scott,D。(编辑)单词的层次聚类和NLP任务的应用。(ed.Scott,D.)关于超大语料库的第四次研讨会(计算语言学协会,Herstmonceux Castle,Sussex,UK,1996)。Murtagh,F。&Contreras,P。层次聚类方法。
(2011). Preprint at https://arxiv.org/abs/1105.0121.Lin, C.-Y. Looking for a few good metrics : Rouge and its evaluation. Proc. of the 4th NTCIR Workshops, Tokyo, Japan (2004).Crossley, S. et al. A large-scaled corpus for assessing text readability. Behav. Res. Methods 55, 491–507 (2022).Article .
(2011年)。预印于https://arxiv.org/abs/1105.0121.Lin,C.-Y。寻找一些好的指标:胭脂及其评估。程序。第四届NTCIR研讨会,日本东京(2004年)。Crossley,S.等人。用于评估文本可读性的大规模语料库。行为。Res.Methods 55491–507(2022)。文章。
PubMed
PubMed
PubMed Central
公共医学中心
Google Scholar
谷歌学者
Food, U. & Administration., D. Establishment Registration and Device Listing. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfrl/rl.cfm.of Health National Library of Medicine., N. I. Pubmed. https://pubmed.ncbi.nlm.nih.gov/.Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N.
美国食品和管理局。,D、 机构注册和设备清单。https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfrl/rl.cfm.of。,N、 一。Pubmed。https://pubmed.ncbi.nlm.nih.gov/.Liu,F.,Shareghi,E.,Meng,Z.,Basaldella,M。和Collier,N。
Self-alignment pretraining for biomedical entity representations. (2020). Preprint at https://arxiv.org/abs/2010.11784.Fiorini, N., Lipman, D. & Lu, Z. Cutting edge: Towards pubmed 2.0. eLife (2017).Williamson, P. & Minter, C. Exploring pubmed as a reliable resource for scholarly communications services.
生物医学实体表示的自对齐预训练。(2020年)。预印于https://arxiv.org/abs/2010.11784.Fiorini。埃利夫(2017)。Williamson,P。&Minter,C。探索pubmed作为学术交流服务的可靠资源。
J. Med. Libr. Assoc. 107, 16–29 (2019)..
J、 。协会107、16-29(2019)。。
Google Scholar
谷歌学者
Spasic, I. & Nenadic, G. Clinical text data in machine learning: systematic review. JMIR Med. Inform.8 (2020).Gosmanov, A., E.O., G. & A.E., K. Hyperglycemic crises: Diabetic ketoacidosis and hyperglycemic hyperosmolar state. Endotext [Internet] (2021).Gliwa, B., Mochol, I., Biesek, M.
Spasic,I。&Nenadic,G。机器学习中的临床文本数据:系统综述。《JMIR医学通报》第8期(2020年)。Gosmanov,A.,E.O.,G。&A.E.,K。高血糖危机:糖尿病酮症酸中毒和高血糖高渗状态。内文本[互联网](2021)。Gliwa,B.,Mochol,I.,Biesek,M。
& Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. (2019). Preprint at https://arxiv.org/abs/1911.12237.Labs., J. S. Summarize Clinical Notes (augmented). https://nlp.johnsnowlabs.com/2023/03/30/summarizer_clinical_jsl_augmented_en.htmluno/abcde.html.Neumann, M., King, D., Beltagy, I.
&Wawer,A。Samsum语料库:用于抽象摘要的人类注释对话数据集。(2019年)。预印于https://arxiv.org/abs/1911.12237.Labs.,J.S。总结临床笔记(增强)。https://nlp.johnsnowlabs.com/2023/03/30/summarizer_clinical_jsl_augmented_en.htmluno/abcde.html.Neumann,M.,金,D.,贝塔吉,I。
& Ammar, W. Scispacy: Fast and Robust Models for Biomedical Natural Language Processing. (2019). Preprint at https://arxiv.org/abs/1902.07669.Tarcar, A. et al. Healthcare Ner Models Using Language Model Pretraining. (2019). Preprint at https://arxiv.org/abs/1910.11241.Lin, C.-Y. Zaimis, E. (ed.) Rouge: A package for automatic evaluation of summaries.
&Ammar,W。Scispacy:生物医学自然语言处理的快速而稳健的模型。(2019年)。预印于https://arxiv.org/abs/1902.07669.Tarcar,A.等人。使用语言模型预训练的医疗保健Ner模型。(2019年)。预印于https://arxiv.org/abs/1910.11241.Lin,C.-Y.Zaimis,E.(编辑)Rouge:用于自动评估摘要的软件包。
(ed.Zaimis, E.) Text Summarization Branches Out, 74–81 (Association for Computational Linguistics, Barcelona, Spain, 2004).Ganesan, K. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. (2018). Preprint at https://arxiv.org/abs/1803.01937.Cohan, A. & Goharian, N. Revisiting Summarization Evaluation for Scientific Articles.
(ed.Zaimis,E。)文本摘要分支,74-81(计算语言学协会,西班牙巴塞罗那,2004)。Ganesan,K.Rouge 2.0:更新和改进了总结任务评估的措施。(2018年)。预印于https://arxiv.org/abs/1803.01937.Cohan,A。&Goharian,N。重新审视科学文章的摘要评估。
(2016). Preprint at https://arxiv.org/abs/1604.00400 .Zhang, Y., J.and Zhao, Saleh, M. & Liu, P. Pegasus: Pre-training with Extracted Gap-Sentences for Abstractive Summarization. (2019). Preprint at https://arxiv.org/abs/1912.08777.Janssen. Rybrevant amivantamab-vmjw, injection bla/nda number: 761210 product quality review.
。预印于https://arxiv.org/abs/1604.00400。Zhang,Y.,J.和Zhao,Saleh,M。&Liu,P。Pegasus:使用提取的间隙句进行抽象总结的预训练。(2019年)。预印于https://arxiv.org/abs/1912.08777.Janssen.Rybrevant amivantamab vmjw,注射bla/nda编号:761210产品质量审查。
(2021). https://www.accessdata.fda.gov/drugsatfda_docs/nda/2021/761210Orig1s000ChemR.pdf.Humaira, H. & Rasyidah, R. Determining the Appr.
(2021年)。https://www.accessdata.fda.gov/drugsatfda_docs/nda/2021/761210Orig1s000ChemR.pdf.Humaira,H。&Rasyidah,R。确定近似值。
Google Scholar
谷歌学者
Januzaj, Y., Beqiri, E. & Luma, A. Determining the optimal number of clusters using silhouette score as a data mining technique. Int. J. Online Biomed. Eng. (iJOE) 19, 174–182 (2023).Article
Januzaj,Y.,Beqiri,E。&Luma,A。使用轮廓评分作为数据挖掘技术确定最佳聚类数。Int.J.在线生物医学。工程(iJOE)19174-182(2023)。文章
Google Scholar
谷歌学者
Barbella, M. & Tortora, G. Rouge Metric Evaluation for Text Summarization Techniques. (2022). Preprint at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4120317 .Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating Text Generation with Bert. arXiv preprint arXiv:1904.09675 (2019).Zhang, T., Kishore, V., Wu, F., Weinberger, K.
Barbella,M。&Tortora,G。Rouge文本摘要技术的度量评估。(2022年)。预印于https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4120317。张,T.,基肖尔,V.,吴,F.,温伯格,K.Q.&Artzi,Y。Bertscore:用Bert评估文本生成。arXiv预印本arXiv:1904.09675(2019)。张,T.,基肖尔,V.,吴,F.,温伯格,K。
& Artzi, Y. Bertscore: Evaluating Text Generation with BERT (2020). https://openreview.net/forum?id=SkeHuCVFDr.Stajner, S., Evans, R., Orasan, C. & Mitkov, R. Rello, L. & Saggion, H. (eds) What can readability measures really tell us about text complexity? (eds Rello, L. & Saggion, H.) Proceedings of the LREC’12 Workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA) (European Language Resources Association (ELRA), Istanbul, Turkey, 2012).Wrigley Kelly, N., Murray, K., McCarthy, C.
&Artzi,Y。Bertscore:用BERT(2020)评估文本生成。https://openreview.net/forum?id=SkeHuCVFDr.Stajner,S.,Evans,R.,Orasan,C。&Mitkov,R。Rello,L。&Saggion,H。(编辑)可读性度量真正能告诉我们关于文本复杂性的什么?(eds Rello,L。和Saggion,H。)LREC'12研讨会论文集:改进文本可访问性的自然语言处理(NLP4ITA)(欧洲语言资源协会(ELRA),土耳其伊斯坦布尔,2012)。Wrigley-Kelly,N.,Murray,K.,McCarthy,C。
& O’Shea, D. An objective analysis of quality and readability of online information on Covid-19. Heal. Technol. 11, 1093–1099 (2021).Article .
&O'Shea,D。对新型冠状病毒肺炎在线信息的质量和可读性进行客观分析。治愈。技术。111093-1099(2021)。文章。
CAS
中科院
Google Scholar
谷歌学者
Download referencesAcknowledgementsWe would like to acknowledge Allison Lo and Valerie Carothers from Pfizer INC., Biotherapeutics & Pharmaceutical Sciences - Transformational Technology Digital Sciences for contributions to edits. We would also like to acknowledge the scientists from Pfizer INC., Biotherapeutics & Pharmaceutical Sciences for their involvment as subject matter experts for reviewing summaries of regulatory documents.Author informationAuthors and AffiliationsBiotherapeutics & Pharmaceutical Sciences, Pfizer INC., 235 E.
下载参考文献致谢我们要感谢辉瑞公司的Allison Lo和Valerie Carothers,Biotherapeutics&Pharmaceutics Sciences-Transformation Technology Digital Sciences对编辑的贡献。我们还要感谢辉瑞公司(Pfizer INC.)、生物治疗与制药科学(Biotherapeutics&Pharmaceutics Sciences)的科学家作为主题专家参与审查监管文件摘要。作者信息作者和附属机构辉瑞公司治疗与药物科学,235 E。
42nd Street, New York, NY, 10017, USANick Steiger & Christopher BurnsDecision Sciences, MResult Corporation, 12 Roosevelt Avenue, Mystic, CT, 06355, USASumit Ranjan, Yajna Bopaiah & Divya ChembachereApplied Sciences, Lumilytics LLC, 436 N. Main St. #1004, Doylestown, PA, 18901, USAAvinash Dalal & Varsha DaswaniAuthorsAvinash DalalView author publicationsYou can also search for this author in.
纽约州纽约市第42街,邮编:10017,USANick Steiger&Christopher BURNSDESCESION Sciences,MResult Corporation,12 Roosevelt Avenue,Mystic,CT,06355,USASumit Ranjan,Yajna Bopaiah&Divya ChembachereApplied Sciences,Lumilytics LLC,436 N.Main St.#1004,Doylestown,PA,18901,USAAvinash Dalal&Varsha DASWANIAUTORSAVINASH DalalView作者出版物您也可以在中搜索这位作者。
PubMed Google ScholarSumit RanjanView author publicationsYou can also search for this author in
PubMed Google ScholarSumit RanjanView作者出版物您也可以在
PubMed Google ScholarYajna BopaiahView author publicationsYou can also search for this author in
PubMed Google ScholarYajna BopaiahView作者出版物您也可以在
PubMed Google ScholarDivya ChembachereView author publicationsYou can also search for this author in
PubMed Google ScholarDivya ChembachereView作者出版物您也可以在
PubMed Google ScholarNick SteigerView author publicationsYou can also search for this author in
PubMed Google ScholarNick SteigerView作者出版物您也可以在
PubMed Google ScholarChristopher BurnsView author publicationsYou can also search for this author in
PubMed Google ScholarChristopher BurnsView作者出版物您也可以在
PubMed Google ScholarVarsha DaswaniView author publicationsYou can also search for this author in
PubMed Google ScholarVarsha DaswaniView作者出版物您也可以在
PubMed Google ScholarContributionsAll authors are contributed to manuscript in equal ways.Corresponding authorsCorrespondence to
PubMed谷歌学术贡献所有作者都以平等的方式为稿件做出贡献。通讯作者通讯
Avinash Dalal or Sumit Ranjan.Ethics declarations
Avinash Dalal或Sumit Ranjan。道德宣言
Competing interests
相互竞争的利益
The authors declare no competing interests.
作者声明没有利益冲突。
Additional informationPublisher's noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissions
Additional informationPublisher的noteSpringer Nature在已发布地图和机构隶属关系中的管辖权主张方面保持中立。权限和权限
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material.
开放获取本文是根据知识共享署名非商业性NoDerivatives 4.0国际许可证授权的,该许可证允许以任何媒介或格式进行任何非商业性使用,共享,分发和复制,只要您对原始作者和来源给予适当的信任,提供知识共享许可证的链接,并指出您是否修改了许可材料。
You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
根据本许可证,您无权共享源自本文或其部分的改编材料。本文中的图像或其他第三方材料包含在文章的知识共享许可证中,除非该材料的信用额度中另有说明。如果材料未包含在文章的知识共享许可证中,并且您的预期用途未被法律法规允许或超出允许的用途,则您需要直接获得版权所有者的许可。
To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/..
要查看此许可证的副本,请访问http://creativecommons.org/licenses/by-nc-nd/4.0/..
Reprints and permissionsAbout this articleCite this articleDalal, A., Ranjan, S., Bopaiah, Y. et al. Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.
转载和许可本文引用本文Dalal,A.,Ranjan,S.,Bopaiah,Y。等人。使用加权评估方法的分层聚类进行药学文本摘要。
Sci Rep 14, 20149 (2024). https://doi.org/10.1038/s41598-024-70618-wDownload citationReceived: 19 March 2024Accepted: 19 August 2024Published: 30 August 2024DOI: https://doi.org/10.1038/s41598-024-70618-wShare this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard.
Sci Rep 1420149(2024)。https://doi.org/10.1038/s41598-024-70618-wDownloadhttps://doi.org/10.1038/s41598-024-70618-wShare本文与您共享以下链接的任何人都可以阅读此内容:获取可共享链接对不起,本文目前没有可共享的链接。复制到剪贴板。
Provided by the Springer Nature SharedIt content-sharing initiative
由Springer Nature SharedIt内容共享计划提供
KeywordsText summarizationRegulatory documentsHierarchical clusteringSapBERTBart-large-cnn-samsumBERTScoreROUGEFlesch reading easeMixtral 8\(\times \)7b instructGPT 3.5Llama-2-70b
关键词文本摘要监管文件电子政务集群Sapbertbart large cnn samsumBERTScoreROUGEFlesch reading easeMixtral 8 \(\ times \)7b instructGPT 3.5Llama-2-70b
CommentsBy submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
评论通过提交评论,您同意遵守我们的条款和社区指南。如果您发现有虐待行为或不符合我们的条款或准则,请将其标记为不合适。