8 datasets found

l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Pattern Mining project
kaggle.com
zip
Updated Mar 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahid Ali (2021). Pattern Mining project [Dataset]. https://www.kaggle.com/datasets/zahidmahar/pattern-mining-project/code
Explore at:
zip(621097 bytes)Available download formats
Dataset updated
Mar 9, 2021
Authors
Zahid Ali
Description
Context

Sequential pattern mining is the discovery of subsequences that are frequent in a set of sequences. The process is similar to the frequent itemset mining1 except that the input database is ordered. As the output of a sequential pattern mining algorithm, it generates a set of frequent sequential patterns, which are sub-sequences that have a frequency in the database greater than or equal to the user-specified minimum support. Let the data set shown in Table 1 where events are accompanied by instants of occurrence in each tuple. https://pasteboard.co/JRNB4rH.png" alt="Image of table">

We can note that, for a fixed threshold equal to 1, the pattern < A, B, C > is considered as frequent because its support (the number of occurrences in the database) is equal to 2.

Content

Problematic and Goal:

Let us assume the example given in Table 1. < A, B, C > is considered a frequent sequential pattern. It shows that events A, B, and C occurred frequently in a sequence manner, but without providing any additional information about the gap between them. For instance, we do not know when B would happen, knowing that A already did. Therefore, we ask you to provide a richer pattern where time constraints are considered. In our data set example, we can deduce that A, B, and C occur sequentially, and that B occurs after A at least after one instant and at most after 5 instants, while C occurs after B in the interval [2, 4] of instants. We represent our pattern as A[1,5]B and B[2,4]C. It is a direct graph where nodes are events and vertices are the instant intervals, denoted by time constraints as shown in Figure 1. https://pasteboard.co/JRNBWWL.png" alt="Image">

Formally, Definition (Event) An event is a couple (e,t) where e ϵ Ε is the type of the event and t ϵ Τ is its time. Definition (Sequence) Let E be a set of event types and T a time domain such that T ⊆ R. E is assumed totally ordered and is denoted #
Facebook Metrics Dataset
kaggle.com
zip
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dileep Naidu (2024). Facebook Metrics Dataset [Dataset]. https://www.kaggle.com/datasets/dileeppatchaone/facebook-metrics-dataset-of-cosmetic-brand/code
Explore at:
zip(17389 bytes)Available download formats
Dataset updated
Aug 9, 2024
Authors
Dileep Naidu
Description
This dataset contains information about posts made on Famous Cosmetic Brand's Facebook page from 1st of January to 31th of December of 2014. Each row represents a single post and includes the following attributes:

Page total likes: The total number of likes for the page at the time of the post. Example: 139441

Type: The type of post (e.g., photo, status, video, link). Example: Photo

Category: A categorical variable representing the content category of the post (the specific meaning of the categories is not defined in the provided data). Example: 2

Post Month: The month the post was published (likely represented numerically, e.g., 12 for December). Example: 12

Post Weekday: The day of the week the post was published (likely represented numerically, e.g., 1 for Monday). Example: 4

Post Hour: The hour of the day the post was published (likely in 24-hour format). Example: 3

Paid: A binary variable indicating whether the post was a paid advertisement (1 for yes, 0 for no). Example: 0

Lifetime Post Total Reach: The total number of unique people who saw the post during its lifetime. Example: 2752

Lifetime Post Total Impressions: The total number of times the post was displayed, regardless of whether it was clicked or seen. Example: 5091

Lifetime Engaged Users: The number of unique people who engaged with the post (e.g., liked, commented, shared, clicked). Example: 178

Lifetime Post Consumers: The number of unique people who clicked anywhere in the post. Example: 109

Lifetime Post Consumptions: The total number of clicks anywhere in the post. Example: 159

Lifetime Post Impressions by people who have liked your Page: The number of times the post was shown to people who liked the page. Example: 3078

Lifetime Post reach by people who like your Page: The number of people who like the page that saw the post. Example: 1640

Lifetime People who have liked your Page and engaged with your post: The number of people who liked the page that engaged with the post. Example: 119

comment: Number of comments on the post. Example: 4

like: Number of likes on the post. Example: 79

share: Number of shares of the post. Example: 17

Total Interactions: Total number of interactions with the post (likely sum of comments, likes, and shares). Example: 100

Citation: (Moro et al., 2016) S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, Elsevier, In press. Available at: http://dx.doi.org/10.1016/j.jbusres.2016.02.010
r
Specification and optimization of analytical data flows
resodate.org
Updated May 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Hüske (2016). Specification and optimization of analytical data flows [Dataset]. http://doi.org/10.14279/depositonce-5150
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-5150
Dataset updated
May 27, 2016
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Fabian Hüske
Description
In the past, the majority of data analysis use cases was addressed by aggregating relational data. Since a few years, a trend is evolving, which is called “Big Data” and which has several implications on the field of data analysis. Compared to previous applications, much larger data sets are analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. At the same time, analysis applications include data sets with less or even no structure at all. This evolution has implications on the requirements on data processing systems. Due to the growing size of data sets and the increasing computational complexity of advanced analysis methods, data must be processed in a massively parallel fashion. The large number and diversity of data analysis techniques as well as the lack of data structure determine the use of user-defined functions and data types. Many traditional database systems are not flexible enough to satisfy these requirements. Hence, there is a need for programming abstractions to define and efficiently execute complex parallel data analysis programs that support custom user-defined operations. The success of the SQL query language has shown the advantages of declarative query specification, such as potential for optimization and ease of use. Today, most relational database management systems feature a query optimizer that compiles declarative queries into physical execution plans. Cost-based optimizers choose from billions of plan candidates the plan with the least estimated cost. However, traditional optimization techniques cannot be readily integrated into systems that aim to support novel data analysis use cases. For example, the use of user-defined functions (UDFs) can significantly limit the optimization potential of data analysis programs. Furthermore, lack of detailed data statistics is common when large amounts of unstructured data is analyzed. This leads to imprecise optimizer cost estimates, which can cause sub-optimal plan choices. In this thesis we address three challenges that arise in the context of specifying and optimizing data analysis programs. First, we propose a parallel programming model with declarative properties to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.
S
NASICON-type solid electrolyte materials named entity recognition dataset
scidb.cn
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00213.00001
Dataset updated
Apr 27, 2023
Dataset provided by
Science Data Bank
Authors
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
Description
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token

Human Activity Classification Dataset

kaggle.com

zip

Updated May 8, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Rabie El Kharoua (2024). Human Activity Classification Dataset [Dataset]. https://www.kaggle.com/datasets/rabieelkharoua/human-activity-classification-dataset

Explore at:

zip(314064223 bytes)Available download formats

Dataset updated

May 8, 2024

Authors

Rabie El Kharoua

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

📊 Calling all data aficionados! 🚀 Just stumbled upon some juicy data that might tickle your fancy! If you find it helpful, a little upvote would be most appreciated! 🙌 #DataIsKing #KaggleCommunity 📈

Data Collection:
- Collected by members of the WISDM (Wireless Sensor Data Mining) Lab at Fordham University.
- Utilized accelerometer and gyroscope sensors from smartphones and smartwatches.
- 51 subjects participated in performing 18 diverse activities of daily living.
- Each activity was performed for 3 minutes per subject, resulting in 54 minutes of data per subject.
- Activities encompassed basic ambulation-related tasks, hand-based activities of daily living, and eating activities.
Activity Categories:
- Basic ambulation-related activities: walking, jogging, climbing stairs.
- Hand-based activities of daily living: brushing teeth, folding clothes.
- Eating activities: eating pasta, eating chips.
Data Description:
- Contains low-level time-series sensor data from phone accelerometers, phone gyroscopes, watch accelerometers, and watch gyroscopes.
- Each time-series data is labeled with the activity being performed and a subject identifier.
- Suitable for building and evaluating biometric models as well as activity recognition models.
Data Transformation:
- Researchers employed a sliding window approach to transform time-series data into labeled examples.
- Scripts for performing the transformation are provided along with the transformed data.
Availability:
- The dataset is accessible from the UCI Machine Learning Repository under the name "WISDM Smartphone and Smartwatch Activity and Biometrics Dataset."
Dataset Name: WISDM Smartphone and Smartwatch Activity and Biometrics Dataset
Subjects and Tasks:
- Data collected from 51 subjects.
- Each subject performed 18 tasks, with each task lasting 3 minutes.
Data Collection Setup:
- Subjects wore a smartwatch on their dominant hand and carried a smartphone in their pocket.
- A custom app controlled data collection on both devices.
- Sensors used: accelerometer and gyroscope on both smartphone and smartwatch.
Sensor Characteristics:
- Data collected at a rate of 20 Hz (every 50ms).
- Four total sensors: accelerometer and gyroscope on both smartphone and smartwatch.
Device Specifications:
- Smartphone: Google Nexus 5/5X or Samsung Galaxy S5 running Android 6.0 (Marshmallow).
- Smartwatch: LG G Watch running Android Wear 1.5.

SUMMARY INFORMATION FOR THE DATASET

Information	Details
Number of subjects	51
Number of activities	18
Minutes collected per activity	3
Sensor polling rate	20 Hz
Smartphone used	Google Nexus 5/5X or Samsung Galaxy S5
Smartwatch used	LG G Watch
Number of raw measurements	15,630,426

THE 18 ACTIVITIES REPRESENTED IN THE DATASET

Activity	Activity Code
Walking	A
Jogging	B
Stairs	C
Sitting	D
Standing	E
Typing	F
Brushing Teeth	G
Eating Soup	H
Eating Chips	I
Eating Pasta	J
Drinking from Cup	K
Eating Sandwich	L
Kicking (Soccer Ball)	M
Playing Catch w/Tennis Ball	O
Dribbling (Basketball)	P
Writing	Q
Clapping	R
Folding Clothes	S

Non-hand-oriented activities:
- Walking
- Jogging
- Stairs
- Standing
- Kicking
Hand-oriented activities (General):
- Dribbling
- Playing catch
- Typing
- Writing
- Clapping
- Brushing teeth
- Folding clothes
Hand-oriented activities (eating):
- Eating pasta
- Eating soup
- Eating sandwich
- Eating chips
- Drinking

DEFINITION OF ELEMENTS IN RAW DATA MEASUREMENTS

Field Name	Description
Subject-id	Type: Symbolic numeric identifier. Uniquely identifies the subject. Range: 1600-1650.
Activity code	Type: Symbolic single letter. Range: A-S (no "N" value)
Time...

300 Places in the US for K-means Clustering
kaggle.com
zip
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongou (2022). 300 Places in the US for K-means Clustering [Dataset]. https://www.kaggle.com/datasets/adamxing2021/300places
Explore at:
zip(2610 bytes)Available download formats
Dataset updated
Aug 16, 2022
Authors
Dongou
Area covered
United States
Description
The file consists of the locations of 300 places in the US. Each location is a two-dimensional point that represents the longitude and latitude of the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and the latitude is 33.5. from Course Data mining / Cluster Analysis by University of Illinois at Urbana-Champaign
quality data set for logistic
kaggle.com
zip
Updated Jan 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vishal kumbhar (2021). quality data set for logistic [Dataset]. https://www.kaggle.com/vishalkumbhar1997/quality-data-set-for-logistic
Explore at:
zip(2310 bytes)Available download formats
Dataset updated
Jan 15, 2021
Authors
vishal kumbhar
Description
D2Hawkeye, a medical data mining company. The company receives claims data. These are data that are generated when an insured patient goes to a medical provider to receive a diagnosis or to have a procedure, for example an x-ray, or to obtain drugs. The medical providers need to get compensated, so the claims data provide the means for them to be paid. An important question is whether we can assess the quality of health care given this claims data.

Why assessing the quality of healthcare is an important objective.

Critical decisions are often made by people with expert knowledge

Healthcare Quality Assessment Good quality care educates patients and controls costs Need to assess quality for proper medical interventions No single set of guidelines for defining quality of healthcare Health protessionals are experts in quality of care assessment Experts are Humans

Healthcare Quality Assessment Expert physicians can evaluate quality by examining a patient’s records This process is time consuming and inefficient Physicians cannot assess quality tor millions of patients Can we develop analytical tools that replicate expert assessment on a large scale?

Learn from expert human judgment

Develop a model, interpret results, and adjust the model Make predictions/evaluations on a large scale Healthcare Quality Assessment Let’s identity poor healthcare quality using analytics Building the dataset Claims Data Electronically available Standardized Not 100% accurate Under-reporting is common Claims for hospital visits can be vague Creating the Dataset- Claims Sample Large health insurance claims database Randomly selected 131 diabetes patients Ages range from 35 to 55 Costs $10,000 - $20,000 September 1, 2003 — August 31, 2005 Creating the Dataset- Expert Review Expert physician reviewed claims and wrote descriptive notes: “Ongoing use of narcotics” “Only on Avandia, not a good first choice drug” “Had regular visits, mammogram and immunizations” “Was given hometesting supplies” Creating the Dataset- Expert Assessment Rated quality on a two-point scale (poor/good) “I'd say care was poor — poorly treated diabetes” “No eye care, but overall I'd say high quality”
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9746900.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Clear search

Close search

Google apps

Main menu

LScD (Leicester Scientific Dictionary)

Pattern Mining project

Context

Content

Problematic and Goal:

Facebook Metrics Dataset

Specification and optimization of analytical data flows

NASICON-type solid electrolyte materials named entity recognition dataset

Human Activity Classification Dataset

SUMMARY INFORMATION FOR THE DATASET

THE 18 ACTIVITIES REPRESENTED IN THE DATASET

DEFINITION OF ELEMENTS IN RAW DATA MEASUREMENTS

300 Places in the US for K-means Clustering

quality data set for logistic

LScD (Leicester Scientific Dictionary)See More Versions

LScD (Leicester Scientific Dictionary)