1 INTRODUCTION
In neuroscience, researchers are fascinated with spatial anatomy, neural distribution, neuroanatomical connectivity [
1], and will often use image data to understand and further explore these fields. However, image data usually has a specific focus in each experiment, such as a certain type of neuron or region of the brain, thus is not entirely comprehensive. In addition, it is very difficult to continuously carry out experimental research at the mesoscopic level across the entire brain. Therefore, more comprehensive literature knowledge is required to accurately design a sophisticated experiment, which will not only reduce image data collection and processing, but also alleviate the problem from the source. However, it is quite a challenge to manually search for neuroscience-related knowledge from large-scale literature sources, creating an urgent need for an automated knowledge extraction method. In recent years, named entity recognition technology has been one of the most discussed topics since it can automatically extract name mentions from literature sources [
2]. This technology can be well qualified for the automatic recognition and labeling of brain region entities in large-scale literature.
Named entity recognition is an information extraction technique that has been widely used to extract a variety of entities from literature, such as genes, proteins [
3–
6], diseases [
4,
5,
7–
9], chemicals [
4,
6,
8–
10], mutations [
11,
12], species [
13], and cell types [
3,
5,
14] in the biomedical field. Leaman
et al. proposed the semi-Markov model [
10], which relied on lexical features such as Token, Part-of-Speech (POS), and N-grams to extract two types of entities, diseases, or chemicals. Wang
et al. proposed the MTM-CW model [
3], which allowed for combining different types of biomedical training datasets, such as genes, diseases, and chemicals. The single model error was reduced by learning relevant knowledge in the field, thereby improving recall. Giorgi
et al. proposed a transfer learning method based on Bi-LSTM-CRF [
15], which utilized many silver standard datasets for training the model. By fine-tuning the parameters on the gold standard datasets, the model expanded its data and produced better results. Lee
et al. added abstracts in PubMed and full texts in PMC for the BERT model [
16] for pre-training [
17]. They also fine-tuned the parameters of multiple types of entity datasets and obtained multiple SOTA’s BioBERT models in the biomedical field. However, WhiteText is the only available gold standard dataset in the neuroscience field [
18] as far as we know, which can annotate brain region entities. This project was first proposed by French
et al. in 2009. They invited several neuroscientists to manually annotate 1,377 literature pieces in
Journal of Comparative Neurology (JCN) and 18,242 brain entities were labeled. Therefore, the WhiteText dataset was used to extract brain region entities in the neuroscience literature. French
et al. used dictionary-based and CRF-based methods [
19] to evaluate the WhiteText dataset and achieved an F1 score of 44% and 79%. Richardet
et al. [
20] added species information and other features to the CRF-based method to reduce false-positive error and achieved 84.6% precision, 78.8% recall, and 81.6% F1 score. Shardlow
et al. [
21] used the Bi-LSTM-CRF deep learning model to obtain the best evaluation result of 81.8% F1 score on the Whitetext dataset.
In this study, we first compared the results of a modified deep learning-based named entity recognition model with other deep learning models on the WhiteText dataset. The modified BioBERT-CRF model achieved an F1 score of 82.6%, the best result among these models. Next, we used the BioBERT-CRF model to extract brain region entities from a large-scale PubMed abstract dataset and normalized all extracted brain region entities to three neuroscience dictionaries to solve the problems of synonyms and different naming systems. Finally, we counted the number of different brain region entities that appeared in the large-scale abstracts and selected the top 30 brain region entities. Our study found a suitable deep learning-based method to extract brain region entities from abstracts, reflecting which anatomical regions researchers are most concerned about.
2 RESULTS
We used a modified named entity recognition model, the BioBERT-CRF model, as well as four benchmark named entity recognition models, the Bi-LSTM-CRF, MTM-CW, BERT, and BioBERT, to extract brain region entities from the WhiteText dataset. Then, we used the BioBERT-CRF model to extract brain region entities from a large-scale neuroscience literature dataset and normalized all extracted brain region entities to three neuroscience dictionaries.
The WhiteText dataset is a brain region entity gold standard dataset constructed by Frenchet al. in 2009 from 1,377 randomly selected abstracts from JCN journals. Several neuroscientists were invited to manually label the brain region entities on the dataset based on the literature content, neuroscience knowledge, reference brain atlases, and dictionaries. Finally, a total of 18,242 brain region entities were labeled. For the labeled brain region entities, the agreement rate of the annotators was 90.7%. We divided part of the WhiteText dataset into a train set, a validation set, and a test set according to 1,000, 150, and 227 abstracts. We randomly divided them into train set, validation set, and test set five times.
2.1 Comparison of the accuracy of different deep learning-based models for predicting brain region entities
We compared the results of five different named entity recognition models on the WhiteText dataset. Out of the 1,377 abstracts, we randomly selected 1,000 as the train set, 150 as the validation set, and the remaining 227 as the test set. We then performed five cross-validations using the precision, recall, and F1 score as evaluation indicators to show the model prediction results. Evaluation data can be found in the appendix. The box diagrams of the five models are shown in Fig.1.
In addition, we examined specific sentences with predicted labels to compare the recognition results of different deep learning models, as shown in Tab.1. The true labels and predicted labels of each model are underlined. In this case, the Bi-LSMT-CRF and MTM-CW model failed to recognize all brain region entities accurately. These two models all recognized adjective words in front. On the contrary, the BERT, BioBERT, and BioBERT-CRF model recognized all entities accurately.
The first model is Bi-LSTM-CRF. With the exception of the third experiment, the precision, recall, and F1 score of Bi-LSTM-CRF on the validation set were higher than 81% and generally higher than the test set results, as shown in Tab.2. Its results in the first and fifth experiments were much higher than the test set. This suggests that the model may have a certain degree of overfitting due to its limited dataset size. The overall F1 score of the model on the test set fluctuated around 81%, indicating that the model is robust.
The following model is MTM-CW. The results of this model in Tab.3 were consistent with that of the Bi-LSTM-CRF model in Tab.2. On the validation and test sets, the model’s precision decreased by about 0.65%, recall increased by about 0.6%, and the F1 score remained unchanged. Since the MTM-CW model shared training parameters, it was logical that the precision would decrease, recall would increase, and overall F1 score would remain unaffected compared to the benchmark Bi-LSTM-CRF model. The improvement of the generalization ability can make the model more competent for the extraction of multiple entity datasets. However, the overall training time of the model was much longer than that of the single entity dataset model.
The third model is BERT, pre-trained on Wikipedia, BooksCorpus, and fine-tuned on the WhiteText dataset. The performance of this model was comparable to the previous models on the validation set and test set. Compared to the results of the MTM-CW model in Tab.3, the BERT model’s recall increased by about 0.6% and 0.91%, while the F1 score increased by about 0.1% and 0.65% on the validation set and test set (Tab.4), respectively. The higher recall suggests that the BERT model can effectively reduce false-negative examples.
The fourth model is BioBERT. The average precision, average recall, and average F1 score of BioBERT on the validation set were 1.31%, 0.27%, and 0.84% higher than the test set, as shown in Tab.5. The same scores of the Bi-LSTM-CRF benchmark model and MTM-CW model were about 2%, 1%, and 1.5% higher. Through comparison, it can be seen that BioBERT had a smaller overfitting degree. In addition, the various indicators of BioBERT showed less variance, with results between 80%−85%. Compared to BERT, the average precision, average recall, and average F1 score of BioBERT on the test set improved by 0.13%, 0.48%, and 0.3%, respectively, indicating better overall performance.
The final model is BioBERT-CRF, a modified BioBERT model. Compared to the previous four models, the BioBERT-CRF model achieved the best F1 score on both the validation and test set, likely due to the changes in its model structure, such as the change in self-attention heads and the addition of the CRF layer. The average precision of BioBERT-CRF is also identical to that of BioBERT. In contrast, its average recall and F1 score increased by 0.81% and 0.49% on the test set, respectively, as shown in Tab.6.
Despite the effectiveness of the BioBERT-CRF model, there are still some error cases in the prediction entities, as shown in Tab.7. In case 1, the true brain region entity is the “ventral lateral geniculate nucleus”. The BioBERT-CRF model failed to recognize the correct left boundary, probably because of the directional phrases in front. In case 2, the word “rectum” is not a true brain region entity, but the BioBERT-CRF model recognized it as an entity. The reason may be that the model has learned that entity words often follow species words. In case 3, “ependyma” and “choroid plexus” are all true brain region entities, but the BioBERT-CRF model did not recognize them. This result may be because the model had not learned these entities in the train set.
In summary, the BioBERT-CRF model demonstrated good performance in extracting brain region entities, but not without its limitations. When recognizing a single brain region entity or multiple parallel brain region entities, it can easily fail to recognize or recognize the wrong entity. In addition, the model cannot detect boundaries correctly when directional words proceed entities.
We compared the results of five different named entity recognition models with those reported in the WhiteText dataset literature. The comparison results are shown in Tab.8.
The best model reported in literature is the Bi-LSTM-CRF model of Shardlow et al., which achieved an F1 score of 81.8%. This result is equivalent to that of the benchmark Bi-LSTM-CRF model in this article. The higher precision may be due to better pre-processing and model parameters. In contrast, the MTM-CW model achieved higher recall due to its generalization ability, while the BERT and BioBERT model achieved higher precision and recall. Among the four benchmark models, the BioBERT model demonstrated state-of-the-art performance. The average recall in 5 repeated experiments was as high as 83.2%, which is 4.4% higher than the traditional CRF model by Richardet et al. and 1.7% higher than the Bi-LSTM-CRF model by Shardlow et al. However, the average precision is lower than Richardet et al., likely because Richardet et al. combined several neuroanatomical lexica feature and species feature with linear chain CRF, allowing for higher precision and lower recall. The overall F1 score of the BioBERT model had a certain improvement effect. The BioBERT* model training with special skills further improved the precision rate, resulting in an F1 score of 82.6%. Compared to previous models, the BioBERT-CRF model achieved the highest F1 score, which can be contributed to its beneficial changes, such as the change in self-attention heads and the addition of the CRF layer.
2.2 Brain region entity extraction in a large-scale PubMed abstract dataset
After comparing the accuracy of different deep learning-based models to predict brain region entities, our results showed that the BioBERT-CRF model demonstrated the best performance in brain region entity extraction. Therefore, the trained BioBERT-CRF model was chosen to extract brain region entities from a large-scale PubMed abstract dataset. A total of 27,436 abstracts were downloaded from PubMed using 399 brain regions from the Allen Brain Atlas (ABA) ontology [
22]. To ensure the brain regions can be found in neuroscience-related literature, certain highly specific brain regions were removed from the ABA ontology, leaving 399 brain regions. Then, a maximum of 200 abstracts were randomly selected for each brain region and the trained BioBERT-CRF model was used to predict brain region entities in a database containing 27,436 PubMed abstracts. The abstracts were pre-processed into the CoNLL-U format [
2] and inputted into the BioBERT-CRF model. The results contained a total of 153,069 predicted brain region entities from the abstracts, which were all normalized according to three neuroscience dictionaries, specifically ABA ontology, BAMS ontology [
23], and NeuroNames lexicon [
24]. This step can solve the problems posed by synonym use and different naming systems. For example, the term “midbrain” has various abbreviations. In ABA and BAMS, “midbrain” is abbreviated as “MB”, but in NeuroNames it is abbreviated as “MBr”. Since the different abbreviations correspond to the same anatomical region, a rule-based method was used to match the brain region entities to their corresponding terms in the three neuroscience dictionaries. A total of 85.2% of the brain region entities were matched. For the remaining unmatched entities, all directional terms were removed and rematched. Finally, the number of different brain region entities appeared in all 27,436 abstracts was counted. Different colors were used to represent the different brain regions, gradients were used to represent the different levels of each brain region in the mammalian sagittal plane, and different sizes were used to represent the total number of brain region entities that appeared in the literature, as shown in Fig.2. The results suggest that researchers are more interested in the cerebral cortex and brainstem, which highly reflect the reality of neuroscience research.
The top 30 brain region entities from the 27,436 abstracts are shown in Tab.9. Results show the cerebrum as the anatomical region researchers are most concerned about, appearing a total of 103,149 times. We can also see that, within the cerebrum, the cerebral cortex is researched more often than the cerebral nuclei. In addition, researchers are also interested in the brainstem, somatomotor areas, and hindbrain, among other anatomical regions.
3 CONCLUSIONS
This study used a modified and four benchmark deep learning models, specifically the BioBERT-CRF, Bi-LSTM-CRF, MTM-CW, BERT, and BioBERT model, to extract brain region entities on the WhiteText dataset. The benchmark model, Bi-LSTM-CRF, achieved an 81.1% F1 score using the semantic information of the text to predict the brain region entity. The MTM-CW model demonstrated improvement in neuroscience brain region entity recognition due to its multiple added training datasets in the biomedical field, resulting in an 81% recall of the recognition task. The BERT, BioBERT, and BioBERT-CRF models were trained in two steps: pre-training and fine-tuning. Unlike BERT, the BioBERT model pre-trained with the PubMed abstracts and PMC full texts, which resulted in a better recall of 83.2% and a better F1 score of 82.1%. On this basis, we added a CRF layer and changed the self-attention heads in the BioBERT model. The modified BioBERT-CRF model achieved the best recall and F1 score of 84.0% and 82.6%. The pre-training using biomedical literature data helped the model learn the semantics of the field. The fine-tuning using the WhiteText dataset helped the model learn the semantics of the brain region entity. The transition matrix of the CRF layer helped reduce error examples. This combination achieved the best prediction results. However, there are still limitations with the BioBERT-CRF model. When recognizing a single vocabulary brain entity or multiple parallel vocabulary brain entities, it can easily fail to recognize or recognize incorrectly. There are also cases of inaccurate boundary detection when directional phrases proceed the entity. The error examples are shown in the results section. The entity recognition boundary issues and single entity non-recognition issues should be addressed and improved in future work.
We used the trained BioBERT-CRF model to predict brain region entities in the 27,436 PubMed abstract dataset and obtained a total of 153,069 entities. We then used a rule-based method to normalize all brain region entities according to three neuroscience dictionaries, specifically ABA ontology, BAMS ontology, and NeuroNames lexicon. A total of 85.2% of the brain region entities were matched to the terms found in these dictionaries. Although the normalization pipeline solved the problems of synonyms and different naming systems, the accuracy of standardization required further improvement by introducing additional annotated terms into the dictionary. The brain region entity recognition model and rule-based normalization pipeline can then extract brain regions from a large set of neuroscience literature, such as the PubMed abstract dataset, and construct a brain region-to-abstract map in a context-sensitive approach to serve as the basis for a neuroscience oriented search engine. In addition, the brain region entities were sorted in order of their frequency in the corpus, reflecting the anatomical regions researchers are most concerned about. In the future, we will extract the relationship between different brain region entities recognized by the BioBERT-CRF model described in this work. The relation extraction will allow us to obtain further information from the literature. Searching for two or more brain entities will provide us with a more accurate literature list, while extracting information on the relationship between different brain entities will help us learn about the neuroanatomical connectivity among brain structures from a bird’s eye view.
4 MATERIALS AND METHODS
4.1 Dataset
We used the White Text dataset to evaluate our proposed framework. The WhiteText dataset was a brain region entity gold standard dataset collected by French
et al. in 2009. The dataset first selected 1,377 abstracts from a multitude of JCN journals with brain connections through keyword search. Then, it used an abbreviation amplification algorithm to replace the brain abbreviations mentioned in the abstracts with their corresponding full names. Next, they invited several neuroscientists to manually label brain region entities based on neuroscience knowledge, reference brain atlases, and dictionaries [
18]. The 1,377 articles contained 18,242 labeled brain entities, and the agreement rate among labelers was 90.7%. We converted the XML format of the WhiteText dataset to the most commonly used CoNLL-U format [
2] and acquired 17,585 brain region entities. The segmented text and corresponding tags were represented line by line, with tab intervals, and sentences were separated with blank lines. The beginning of the document was marked with “-DOCSTART- -X- -X- -X- O”. The label can select either the BIO or BIOES format, where “B” represents the beginning of the entity, “I” represents the middle of the entity, “E” represents the end of the entity, “O” represents a non-entity, and “S” represents a single entity.
4.2 Word embeddings
In the Bi-LSTM-CRF and MTM-CW model, we initialized the word embedding matrix with pre-trained word vectors from word2vec, obtained by Pyysalo
et al. [
25], using Wikipedia corpus, PubMed abstracts, and PMC full-text training. These word embeddings were trained using a skip-gram model, as described by Mikolov
et al. [
26]. In the BERT and BioBERT model, word2vec was not used to initialize the word embedding matrix since they learned the WordPiece embeddings [
27] from scratch during pre-training. The WordPiece tokenization divided each word into a limited set of standard sub-word units and learned all the sub-word unit’s vectors. The pre-training step was also carried out using the Wikipedia corpus, PubMed abstracts, and PMC full-text. The embedding dictionary saved its vectors for each word.
4.3 Evaluation
In NER, the predicted results can be true positive (TP), false positive (FP), true negative (TN), or false negative (FN), based on the true label. For the prediction results, the most common evaluation indicators are precision, recall, and F1 score. The precision represents the number of entities that the model predicts correctly to the total number of entities whose predictions are positive, as defined:
Recall refers to the ratio of the number of entities that the model correctly predicts to the number of real entities marked on the dataset, as defined:
The F1 score compromises the precision and recall, thus being more representative. It is defined as follows:
4.4 Model
BioBERT-CRF: in natural language processing, transfer learning has been used to pre-train neural network language models with a large amount of unstructured text data and subsequently fine-tune the target task to solve the missing gold standard in the target domain. BERT is a representative result of the pre-training model [
16]. The BERT model can achieve better performance by pre-training on the corpus and fine-tuning on the specific dataset. As a pre-training model in the biomedical field, BioBERT has achieved better results than the Bi-LSTM-CRF model and BERT model in many natural language processing tasks [
17]. Therefore, we applied the BioBERT model to entity extraction in the neuroscience field, fine-tuned it, and evaluated the results of the neuroscience dataset extraction. In addition, we added a CRF layer to the BioBERT model’s output to calculate label probability.
The BioBERT-CRF model first obtains the feature vector of the input text through three embedding features, specifically WordPiece embedding [
27], position embedding, and segmentation embedding. WordPiece tokenization divides the word into a limited set of standard sub-word units. Position embedding encodes the position information of the word into a feature vector. Segmentation embedding distinguishes between two sentences. For sentence pairs, segmentation embedding sets the feature value of the first sentence to 0 and second sentence to 1. Then, the feature vector is inputted into the bidirectional transformers. The transformer’s encoder is composed of multi-head attention and a full connection is used to convert the input corpus into a feature vector. The transformer’s decoder inputs the output of the encoder and predicts the result. The decoder is composed of masked multi-head attention, multi-head attention, and a full connection, which are used to output the conditional probability of the final result. Then, the model receives the predicted label sequence through the linear classifier. Next, the predicted label sequence is inputted into the CRF layer, which will multiply the output of the BioBERT layer with the parameter matrix to obtain the state transition matrix
, where
represents the transition probability from the
label position to the
label position. The label output at different positions of the sequence is calculated by the sum of the output
of the BioBERT layer and the transition matrix
of the CRF layer. The probability value is normalized by the softmax activation function. The maximum likelihood function will calculate the loss and the gradient descent algorithm is used to calculate the model parameters. Finally, the model parameters are updated layer by layer through the backpropagation algorithm and the neural network is fine-tuned. The architecture of the model is shown in Fig.3.
The parameters of the BioBERT-CRF model are: the initial learning rate is , the epoch is 10, the batch size is 32, the dropout size of the attention layer and hidden layer is set to 0.1, the optimization algorithm is Adam, the self-attention heads are 24, and the remaining parameters are the default BioBERT model settings.
The Authors (2022). Published by Higher Education Press.