8 datasets found

l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
f
Data from: DUBS: A Framework for Developing Directory of Useful Benchmarking...
acs.figshare.com
datasetcatalog.nlm.nih.gov
+1more
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Fine; Matthew Muhoberac; Guillaume Fraux; Gaurav Chopra (2023). DUBS: A Framework for Developing Directory of Useful Benchmarking Sets for Virtual Screening [Dataset]. http://doi.org/10.1021/acs.jcim.0c00122.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.0c00122.s003
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Jonathan Fine; Matthew Muhoberac; Guillaume Fraux; Gaurav Chopra
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Benchmarking is a crucial step in evaluating virtual screening methods for drug discovery. One major issue that arises among benchmarking data sets is a lack of a standardized format for representing the protein and ligand structures used to benchmark the virtual screening method. To address this, we introduce the Directory of Useful Benchmarking Sets (DUBS) framework, as a simple and flexible tool to rapidly create benchmarking sets using the protein databank. DUBS uses a simple input text based format along with the Lemon data mining framework to efficiently access and organize data to the protein databank and output commonly used inputs for virtual screening software. The simple input format used by DUBS allows users to define their own benchmarking data sets and access the corresponding information directly from the software package. Currently, it only takes DUBS less than 2 min to create a benchmark using this format. Since DUBS uses a simple python script, users can easily modify this to create more complex benchmarks. We hope that DUBS will be a useful community resource to provide a standardized representation for benchmarking data sets in virtual screening. The DUBS package is available on GitHub at https://github.com/chopralab/lemon/tree/master/dubs.
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
Product data mining: entity classification&linking
kaggle.com
zip
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zzhang (2020). Product data mining: entity classification&linking [Dataset]. https://www.kaggle.com/ziqizhang/product-data-miningentity-classificationlinking
Explore at:
zip(10933 bytes)Available download formats
Dataset updated
Jul 13, 2020
Authors
zzhang
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

We release two datasets that are part of the the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data is co-located with the 19th International Semantic Web Conference (https://iswc2020.semanticweb.org/, 2-6 Nov 2020 at Athens, Greece). The datasets belong to two shared tasks related to product data mining on the Web: (1) product matching (linking) and (2) product classification. This event is organised by The University of Sheffield, The University of Mannheim and Amazon, and is open to anyone. Systems successfully beating the baseline of the respective task, will be invited to write a paper describing their method and system and present the method as a poster (and potentially also a short talk) at the ISWC2020 conference. Winners of each task will be awarded 500 euro as prize (partly sponsored by Peak Indicators, https://www.peakindicators.com/).

2. Task and dataset brief

The challenge organises two tasks, product matching and product categorisation.

i) Product Matching deals with identifying product offers on different websites that refer to the same real-world product (e.g., the same iPhone X model offered using different names/offer titles as well as different descriptions on various websites). A multi-million product offer corpus (16M) containing product offer clusters is released for the generation of training data. A validation set containing 1.1K offer pairs and a test set of 600 offer pairs will also be released. The goal of this task is to classify if the offer pairs in these datasets are match (i.e., referring to the same product) or non-match.

ii) Product classification deals with assigning predefined product category labels (which can be multiple levels) to product instances (e.g., iPhone X is a ‘SmartPhone’, and also ‘Electronics’). A training dataset containing 10K product offers, a validation set of 3K product offers and a test set of 3K product offers will be released. Each dataset contains product offers with their metadata (e.g., name, description, URL) and three classification labels each corresponding to a level in the GS1 Global Product Classification taxonomy. The goal is to classify these product offers into the pre-defined category labels.

All datasets are built based on structured data that was extracted from the Common Crawl (https://commoncrawl.org/) by the Web Data Commons project (http://webdatacommons.org/). Datasets can be found at: https://ir-ischool-uos.github.io/mwpd/

3. Resources and tools

The challenge will also release utility code (in Python) for processing the above datasets and scoring the system outputs. In addition, the following language resources for product-related data mining tasks: A text corpus of 150 million product offer descriptions Word embeddings trained on the above corpus

4. Challenge website

For details of the challenge please visit https://ir-ischool-uos.github.io/mwpd/

5. Organizing committee

Dr Ziqi Zhang (Information School, The University of Sheffield) Prof. Christian Bizer (Institute of Computer Science and Business Informatics, The Mannheim University) Dr Haiping Lu (Department of Computer Science, The University of Sheffield) Dr Jun Ma (Amazon Inc. Seattle, US) Prof. Paul Clough (Information School, The University of Sheffield & Peak Indicators) Ms Anna Primpeli (Institute of Computer Science and Business Informatics, The Mannheim University) Mr Ralph Peeters (Institute of Computer Science and Business Informatics, The Mannheim University) Mr. Abdulkareem Alqusair (Information School, The University of Sheffield)

6. Contact

To contact the organising committee please use the Google discussion group https://groups.google.com/forum/#!forum/mwpd2020

Human Activity Classification Dataset

kaggle.com

zip

Updated May 8, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Rabie El Kharoua (2024). Human Activity Classification Dataset [Dataset]. https://www.kaggle.com/datasets/rabieelkharoua/human-activity-classification-dataset

Explore at:

zip(314064223 bytes)Available download formats

Dataset updated

May 8, 2024

Authors

Rabie El Kharoua

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

📊 Calling all data aficionados! 🚀 Just stumbled upon some juicy data that might tickle your fancy! If you find it helpful, a little upvote would be most appreciated! 🙌 #DataIsKing #KaggleCommunity 📈

Data Collection:
- Collected by members of the WISDM (Wireless Sensor Data Mining) Lab at Fordham University.
- Utilized accelerometer and gyroscope sensors from smartphones and smartwatches.
- 51 subjects participated in performing 18 diverse activities of daily living.
- Each activity was performed for 3 minutes per subject, resulting in 54 minutes of data per subject.
- Activities encompassed basic ambulation-related tasks, hand-based activities of daily living, and eating activities.
Activity Categories:
- Basic ambulation-related activities: walking, jogging, climbing stairs.
- Hand-based activities of daily living: brushing teeth, folding clothes.
- Eating activities: eating pasta, eating chips.
Data Description:
- Contains low-level time-series sensor data from phone accelerometers, phone gyroscopes, watch accelerometers, and watch gyroscopes.
- Each time-series data is labeled with the activity being performed and a subject identifier.
- Suitable for building and evaluating biometric models as well as activity recognition models.
Data Transformation:
- Researchers employed a sliding window approach to transform time-series data into labeled examples.
- Scripts for performing the transformation are provided along with the transformed data.
Availability:
- The dataset is accessible from the UCI Machine Learning Repository under the name "WISDM Smartphone and Smartwatch Activity and Biometrics Dataset."
Dataset Name: WISDM Smartphone and Smartwatch Activity and Biometrics Dataset
Subjects and Tasks:
- Data collected from 51 subjects.
- Each subject performed 18 tasks, with each task lasting 3 minutes.
Data Collection Setup:
- Subjects wore a smartwatch on their dominant hand and carried a smartphone in their pocket.
- A custom app controlled data collection on both devices.
- Sensors used: accelerometer and gyroscope on both smartphone and smartwatch.
Sensor Characteristics:
- Data collected at a rate of 20 Hz (every 50ms).
- Four total sensors: accelerometer and gyroscope on both smartphone and smartwatch.
Device Specifications:
- Smartphone: Google Nexus 5/5X or Samsung Galaxy S5 running Android 6.0 (Marshmallow).
- Smartwatch: LG G Watch running Android Wear 1.5.

SUMMARY INFORMATION FOR THE DATASET

Information	Details
Number of subjects	51
Number of activities	18
Minutes collected per activity	3
Sensor polling rate	20 Hz
Smartphone used	Google Nexus 5/5X or Samsung Galaxy S5
Smartwatch used	LG G Watch
Number of raw measurements	15,630,426

THE 18 ACTIVITIES REPRESENTED IN THE DATASET

Activity	Activity Code
Walking	A
Jogging	B
Stairs	C
Sitting	D
Standing	E
Typing	F
Brushing Teeth	G
Eating Soup	H
Eating Chips	I
Eating Pasta	J
Drinking from Cup	K
Eating Sandwich	L
Kicking (Soccer Ball)	M
Playing Catch w/Tennis Ball	O
Dribbling (Basketball)	P
Writing	Q
Clapping	R
Folding Clothes	S

Non-hand-oriented activities:
- Walking
- Jogging
- Stairs
- Standing
- Kicking
Hand-oriented activities (General):
- Dribbling
- Playing catch
- Typing
- Writing
- Clapping
- Brushing teeth
- Folding clothes
Hand-oriented activities (eating):
- Eating pasta
- Eating soup
- Eating sandwich
- Eating chips
- Drinking

DEFINITION OF ELEMENTS IN RAW DATA MEASUREMENTS

Field Name	Description
Subject-id	Type: Symbolic numeric identifier. Uniquely identifies the subject. Range: 1600-1650.
Activity code	Type: Symbolic single letter. Range: A-S (no "N" value)
Time...

Table_1_Assessing the Multiple Dimensions of Poverty. Data Mining Approaches...
frontiersin.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carina Källestål; Elmer Zelaya Blandón; Rodolfo Peña; Wilton Peréz; Mariela Contreras; Lars-Åke Persson; Oleg Sysoev; Katarina Ekholm Selling (2023). Table_1_Assessing the Multiple Dimensions of Poverty. Data Mining Approaches to the 2004–14 Health and Demographic Surveillance System in Cuatro Santos, Nicaragua.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2019.00409.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2019.00409.s002
Dataset updated
May 30, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Carina Källestål; Elmer Zelaya Blandón; Rodolfo Peña; Wilton Peréz; Mariela Contreras; Lars-Åke Persson; Oleg Sysoev; Katarina Ekholm Selling
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We identified clusters of multiple dimensions of poverty according to the capability approach theory by applying data mining approaches to the Cuatro Santos Health and Demographic Surveillance database, Nicaragua. Four municipalities in northern Nicaragua constitute the Cuatro Santos area, with 25,893 inhabitants in 5,966 households (2014). A local process analyzing poverty-related problems, prioritizing suggested actions, was initiated in 1997 and generated a community action plan 2002–2015. Interventions were school breakfasts, environmental protection, water and sanitation, preventive healthcare, home gardening, microcredit, technical training, university education stipends, and use of the Internet. In 2004, a survey of basic health and demographic information was performed in the whole population, followed by surveillance updates in 2007, 2009, and 2014 linking households and individuals. Information included the house material (floor, walls) and services (water, sanitation, electricity) as well as demographic data (birth, deaths, migration). Data on participation in interventions, food security, household assets, and women's self-rated health were collected in 2014. A K-means algorithm was used to cluster the household data (56 variables) in six clusters. The poverty ranking of household clusters using the unsatisfied basic needs index variables changed when including variables describing basic capabilities. The households in the fairly rich cluster with assets such as motorbikes and computers were described as modern. Those in the fairly poor cluster, having different degrees of food insecurity, were labeled vulnerable. Poor and poorest clusters of households were traditional, e.g., in using horses for transport. Results displayed a society transforming from traditional to modern, where the forerunners were not the richest but educated, had more working members in household, had fewer children, and were food secure. Those lagging were the poor, traditional, and food insecure. The approach may be useful for an improved understanding of poverty and to direct local policy and interventions.
Connectome datasets.
plos.figshare.com
xls
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Bénichou; Jean-Baptiste Masson; Christian L. Vestergaard (2024). Connectome datasets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012460.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012460.t002
Dataset updated
Oct 22, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Alexis Bénichou; Jean-Baptiste Masson; Christian L. Vestergaard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For each connectome, we list its number of non-isolated nodes, N, its number of directed edges, E, its density ρ = E/[N(N − 1)], the features of the most compressing model for the connectome, its compressibility ΔL*, the difference in codelengths between the best models with and without motifs, ΔLmotifs, and the reference to the original publication of the dataset. The absolute compressibility ΔL* measures the number of bits that the shortest-codelength model compresses compared to a simple Erdős-Rényi model (Eq (15)). The difference in compression with and without motifs, ΔLmotifs, quantifies the significance of the inferred motif sets as the number of bits gained by the motif-based encoding compared to the optimal motif-free, dyadic model. For datasets where no motifs are found, this column is marked as “N/A”. All datasets are available at https://gitlab.pasteur.fr/sincobe/brain-motifs/-/tree/master/data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3

LScDC (Leicester Scientific Dictionary-Core)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9896579.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (

Clear search

Close search

Google apps

Main menu

LScDC (Leicester Scientific Dictionary-Core)

Data from: DUBS: A Framework for Developing Directory of Useful Benchmarking...

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

Best Books Ever Dataset

Product data mining: entity classification&linking

IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

2. Task and dataset brief

3. Resources and tools

4. Challenge website

5. Organizing committee

6. Contact

Human Activity Classification Dataset

SUMMARY INFORMATION FOR THE DATASET

THE 18 ACTIVITIES REPRESENTED IN THE DATASET

DEFINITION OF ELEMENTS IN RAW DATA MEASUREMENTS

Table_1_Assessing the Multiple Dimensions of Poverty. Data Mining Approaches...

Connectome datasets.

LScDC (Leicester Scientific Dictionary-Core)See More Versions

LScDC (Leicester Scientific Dictionary-Core)