69 datasets found

USA Name Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Data.govhttps://data.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?
Namesakes
figshare.com
json
Updated Nov 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17009105.v1
Dataset updated
Nov 20, 2021
Dataset provided by
figshare
Authors
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
d
Race and ethnicity data for first, middle, and last names
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosenman, Evan; Olivella, Santiago; Imai, Kosuke (2023). Race and ethnicity data for first, middle, and last names [Dataset]. http://doi.org/10.7910/DVN/SGKW0K
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SGKW0K
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Rosenman, Evan; Olivella, Santiago; Imai, Kosuke
Description
We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
w
Dataset of books called Selected : why some people lead, why others follow,...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Selected : why some people lead, why others follow, and why it matters [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Selected+%3A+why+some+people+lead%2C+why+others+follow%2C+and+why+it+matters
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is Selected : why some people lead, why others follow, and why it matters. It features 7 columns including author, publication date, language, and book publisher.
P
GENTER Dataset
paperswithcode.com
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Drechsel; Steffen Herbold (2025). GENTER Dataset [Dataset]. https://paperswithcode.com/dataset/genter
Explore at:
Dataset updated
Feb 25, 2025
Authors
Jonathan Drechsel; Steffen Herbold
Description
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.

Dataset Details Dataset Description

This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.

This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.

Dataset Sources

Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Dataset Structure

text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus

Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1

Dataset Creation Curation Rationale

For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.

Source Data

The dataset is derived from BookCorpus by filtering it and extracting the template structure.

We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.

Data Collection and Processing

We filter the entries of BookCorpus and include only sentences that meet the following criteria:

Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.

The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

Bias, Risks, and Limitations

Due to BookCorpus, only lower-case sentences are contained.
Baby Names from Social Security Card Applications - National Data
catalog.data.gov
data.amerigeoss.org
Updated May 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2022). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Explore at:
Dataset updated
May 5, 2022
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
a
Facebook Names Dataset
academictorrents.com
bittorrent
Updated Nov 11, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ron Bowes (Skull Security) (2015). Facebook Names Dataset [Dataset]. https://academictorrents.com/details/e54c73099d291605e7579b90838c2cd86a8e9575
Explore at:
bittorrent(2991052604)Available download formats
Dataset updated
Nov 11, 2015
Dataset authored and provided by
Ron Bowes (Skull Security)
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on
u
Labelled FHYA Dataset
zivahub.uct.ac.za
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.19029692.v1
Dataset updated
Feb 2, 2022
Dataset provided by
University of Cape Town
Authors
Jarryd Dunn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
People Data Labs - Person Dataset
datarade.ai
.json, .csv
Updated Jul 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
People Data Labs (2020). People Data Labs - Person Dataset [Dataset]. https://datarade.ai/data-products/global-license
Explore at:
.json, .csvAvailable download formats
Dataset updated
Jul 6, 2020
Dataset provided by
People Data Labs Inc.
Authors
People Data Labs
Area covered
Kenya, Afghanistan, United States of America, Wallis and Futuna, Tunisia, Antarctica, Guinea, Anguilla, Bosnia and Herzegovina, Maldives
Description
People Data Labs is an aggregator of B2B person and company data. We source our globally compliant person dataset via our "Data Union".

The "Data Union" is our proprietary data sharing co-op. Customers opt-in to sharing their data and warrant that their data is fully compliant with global data privacy regulations. Some data sources are provided as a one time dump, others are refreshed every time we do a new data build. Our data sources come from a variety of verticals including HR Tech, Real Estate Tech, Identity/Anti-Fraud, Martech, and others. People Data Labs works with customers on compliance based topics. If a customer wishes to ensure anonymity, we work with them to anonymize the data.

Our person data has over 100 fields including resume data (work history, education), contact information (email, phone), demographic info (name, gender, birth date) and social profile information (linkedin, github, twitter, facebook, etc...).
w
Dataset of books called The other Schindlers : why some people chose to save...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The other Schindlers : why some people chose to save Jews in the Holocaust [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+other+Schindlers+%3A+why+some+people+chose+to+save+Jews+in+the+Holocaust
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is The other Schindlers : why some people chose to save Jews in the Holocaust. It features 7 columns including author, publication date, language, and book publisher.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
United States Census
kaggle.com
zip
Updated Apr 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Census Bureau (2018). United States Census [Dataset]. https://www.kaggle.com/census/census-bureau-usa
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 17, 2018
Dataset provided by
United States Census Bureauhttp://census.gov/
Authors
US Census Bureau
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

The United States Census is a decennial census mandated by Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers."
Source: https://en.wikipedia.org/wiki/United_States_Census

Content

The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident of the US. The census occurs every 10 years and is conducted by the United States Census Bureau. Census data is publicly available through the census website, but much of the data is available in summarized data and graphs. The raw data is often difficult to obtain, is typically divided by region, and it must be processed and combined to provide information about the nation as a whole.

The United States census dataset includes nationwide population counts from the 2000 and 2010 censuses. Data is broken out by gender, age and location using zip code tabular areas (ZCTAs) and GEOIDs. ZCTAs are generalized representations of zip codes, and often, though not always, are the same as the zip code for an area. GEOIDs are numeric codes that uniquely identify all administrative, legal, and statistical geographic areas for which the Census Bureau tabulates data. GEOIDs are useful for correlating census data with other censuses and surveys.

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:census_bureau_usa

https://cloud.google.com/bigquery/public-data/us-census

Dataset Source: United States Census Bureau

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by Steve Richey from Unsplash.

Inspiration

What are the ten most populous zip codes in the US in the 2010 census?

What are the top 10 zip codes that experienced the greatest change in population between the 2000 and 2010 censuses?

https://cloud.google.com/bigquery/images/census-population-map.png" alt="https://cloud.google.com/bigquery/images/census-population-map.png"> https://cloud.google.com/bigquery/images/census-population-map.png
Z
Modern China Geospatial Database - Main Dataset
data.niaid.nih.gov
zenodo.org
Updated Feb 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Henriot (2025). Modern China Geospatial Database - Main Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5735393
Explore at:
Dataset updated
Feb 28, 2025
Dataset authored and provided by
Christian Henriot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)

You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp

Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.

One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.

From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.

UPDATES

MCGD_Data2025_02_28 includes a major change with the duplication of all the locations listed under Beijing, Shanghai, Tianjin, and Chongqing (北京, 上海, 天津, 重慶) and their listing under the name of the provinces to which they belonge origially before the creation of the four special municipalities after 1949. This is meant to facilitate the matching of data from historical sources. Each location has a unique NameID. Altogether there are 472,818 entries

MCGD_Data2025_02_27 inclues an update on locations extracted from Minguo zhengfu ge yuanhui keyuan yishang zhiyuanlu 國民政府各院部會科員以上職員錄 (Directory of staff members and above in the ministries and committees of the National Government). Nanjing: Guomin zhengfu wenguanchu yinzhuju 國民政府文官處印鑄局國民政府文官處印鑄局, 1944). We also made corrections in the Prov_Py and Prov_Zh columns as there were some misalignments between the pinyin name and the name in Chines characters. The file now includes 465,128 entries.

MCGD_Data2024_03_23 includes an update on locations in Taiwan from the Asia Directories. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown").

MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.
h
azerbaijani-ner-dataset
huggingface.co
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LocalDoc (2024). azerbaijani-ner-dataset [Dataset]. http://doi.org/10.57967/hf/2484
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2484
Dataset updated
Jun 13, 2024
Dataset authored and provided by
LocalDoc
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Azerbaijani Named Entity Recognition (NER) Dataset

This repository contains the dataset for training and evaluating Named Entity Recognition (NER) models in the Azerbaijani language. The dataset includes annotated text data with various named entities.

Dataset Description

The dataset includes the following entity types:

0: O: Outside any named entity 1: PERSON: Names of individuals 2: LOCATION: Geographical locations, both man-made and natural 3: ORGANISATION: Names of… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset.

Hate Crimes

kaggle.com

Updated Jul 7, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Melissa Monfared (2024). Hate Crimes [Dataset]. https://www.kaggle.com/datasets/melissamonfared/hate-crimes

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 7, 2024

Dataset provided by

Kaggle

Authors

Melissa Monfared

Description

Overview:

This dataset contains detailed information on cases where a hate or bias crime has been reported to the Bloomington Police Department. Hate crimes are criminal offenses motivated by bias against race, religion, ethnicity, sexual orientation, gender identity, or other protected characteristics. This dataset provides insights into the nature and demographics of hate crimes in Bloomington, aiding in understanding and addressing these incidents.

Dataset Details:

The dataset includes the following columns:

Column Name	Description	API Field Name	Data Type
case_number	Case Number	case_number	Text
date	Date	date	Floating Timestamp
weekday	Day of Week	day_of_week	Text
victims	Total Number of Victims	victims	Number
victim_race	Victim Race	victim_race	Text
victim_gender	Victim Gender	victim_gender	Text
victim_type	Victim Type	victim_type	Text
offenders	Total Number of Offenders	offenders	Number
offender_race	Offender Race	offender_race	Text
offender_gender	Offender Gender	offender_gender	Text
offense	Offense / Crime	offense	Text
location_type	Offense / Crime Location Type	location_type	Text
motivation	Offense/Crime Bias Motivation	motivation	Text

Key Features:

Comprehensive Crime Data: Provides detailed information on hate crimes, including demographics of victims and offenders, types of offenses, and bias motivations.
Temporal Analysis: Includes timestamps for each incident, allowing for analysis of trends over time.
Demographic Insights: Offers data on race and gender of both victims and offenders, helping to identify patterns and target interventions.
Location Information: Contains details about the type of location where the offense occurred, useful for spatial analysis and preventive measures.

Usage:

This dataset can be used for:

Crime Analysis: Analyzing trends and patterns in hate crimes to inform law enforcement strategies and policies.
Community Safety: Identifying high-risk areas and times to improve community policing and preventive measures.
Research and Advocacy: Supporting academic research and advocacy efforts focused on combating hate crimes and promoting social justice.
Policy Development: Assisting policymakers in developing targeted initiatives to reduce hate crimes and support affected communities.

Data Maintenance:

Last Updated: July 7, 2024
Source: Bloomington Police Department Data Portal
Revisions: The dataset is annually updated to ensure the inclusion of the latest incidents and to maintain data accuracy. Historical data is preserved to support long-term analyses.

Additional Notes

Data Accuracy: The Bloomington Police Department strives for accuracy in open data; however, errors may occur due to the nature of data collection from multiple sources.
Data Interpretation: Users should be aware that the dataset may change over time as new information becomes available or corrections are made.
Race and District Codes: The dataset uses specific codes for race and reading districts, which are detailed in the accompanying documentation to ensure proper interpretation.
License: Open Data Commons Public Domain Dedication and License

A
‘Austin's data portal activity metrics’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Austin's data portal activity metrics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-austin-s-data-portal-activity-metrics-1ce3/latest
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Austin's data portal activity metrics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/data-portal-activity-metricse on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Background

Austin's open data portal provides lots of public data about the City of Austin. It also provides portal administrators with behind-the-scenes information about how the portal is used... but that data is mysterious, hard to handle in a spreadsheet, and not located all in one place.

Until now! Authorized city staff used admin credentials to grab this usage data and share it the public. The City of Austin wants to use this data to inform the development of its open data initiative and manage the open data portal more effectively.

This project contains related datasets for anyone to explore. These include site-level metrics, dataset-level metrics, and department information for context. A detailed detailed description of how the files were prepared (along with code) can be found on github here.

Example questions to answer about the data portal

What parts of the open data portal do people seem to value most?

What can we tell about who our users are?

How are our data publishers doing?

How much data is published programmatically vs manually?

How data is super fresh? Super stale?

Whatever you think we should know...

About the files

all_views_20161003.csv

There is a resource available to portal administrators called "Dataset of datasets". This is the export of that resource, and it was captured on Oct 3, 2016. It contains a summary of the assets available on the data portal. While this file contains over 1400 resources (such as views, charts, and binary files), only 363 are actual tabular datasets.

table_metrics_ytd.csv

This file contains information about the 363 tabular datasets on the portal. Activity metrics for an individual dataset can be accessed by calling Socrata's views/metrics API and passing along the dataset's unique ID, a time frame, and admin credentials. The process of obtaining the 363 identifiers, calling the API, and staging the information can be reviewed in the python notebook here.

site_metrics.csv

This file is the export of site-level stats that Socrata generates using a given time frame and grouping preference. This file contains records about site usage each month from Nov 2011 through Sept 2016. By the way, it contains 285 columns... and we don't know what many of them mean. But we are determined to find out!! For a preliminary exploration of the columns and what portal-related business processes to which they might relate, check out the notes in this python notebook here

city_departments_in_current_budget.csv

This file contains a list of all City of Austin departments according to how they're identified in the most recently approved budget documents. Could be helpful for getting to know more about who the publishers are.

crosswalk_to_budget_dept.csv

The City is in the process of standardizing how departments identify themselves on the data portal. In the meantime, here's a crosswalk from the department values observed in all_views_20161003.csv to the department names that appear in the City's budget

This dataset was created by Hailey Pate and contains around 100 samples along with Di Sync Success, Browser Firefox 19, technical information and other features such as: - Browser Firefox 33 - Di Sync Failed - and more.

How to use this dataset

Analyze Sf Query Error User in relation to Js Page View Admin

Study the influence of Browser Firefox 37 on Datasets Created

More datasets

Acknowledgements

If you use this dataset in your research, please credit Hailey Pate

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
o
Notices of Name Changes
data.ontario.ca
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government and Consumer Services (2021). Notices of Name Changes [Dataset]. https://data.ontario.ca/dataset/notices-of-name-changes
Explore at:
(None)Available download formats
Dataset updated
Dec 9, 2021
Dataset authored and provided by
Government and Consumer Services
License
https://www.ontario.ca/page/copyright-informationhttps://www.ontario.ca/page/copyright-information
Time period covered
Oct 5, 2016
Area covered
Ontario
Description
This dataset contains a listing of individuals who have had their name formally changed in Ontario.

This data is made publicly available through the Ontario Gazette.
Z
Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...
data.niaid.nih.gov
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mihael Mohorčič (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
Explore at:
Dataset updated
Jan 6, 2023
Dataset provided by
Mihael Mohorčič
Andrej Hrovat
Aleš Simončič
Miha Mohorčič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

Related dataset

Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

Measurement setup

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

Data preprocessing

The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:

{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

Folder structure

For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

Environments description

The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

Known dataset shortcomings

Due to technical and physical limitations, the dataset contains some identified deficiencies.

PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

Location 1 - Piazza del Duomo - Chierici

The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

Location 2 - Via Etnea - Piazza del Duomo

The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

Location 3 - Via Etnea - Piazza Università

Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

Location 4 - Piazza Università

This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

Recognitions

The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
P
How Fast Can I Get Help by Calling United Airlines Support? Dataset
paperswithcode.com
Updated Jun 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). How Fast Can I Get Help by Calling United Airlines Support? Dataset [Dataset]. https://paperswithcode.com/dataset/how-fast-can-i-get-help-by-calling-united
Explore at:
Dataset updated
Jun 23, 2025
Description
☎️+1(888) 642-5075 When you need quick assistance, calling United Airlines support at ☎️+1(888) 642-5075 is the most direct way to get help fast. ☎️+1(888) 642-5075 This official number connects you to knowledgeable agents trained to resolve issues promptly. ☎️+1(888) 642-5075 Dialing ☎️+1(888) 642-5075 puts you in touch with real people ready to assist immediately.

The speed of help you receive after calling ☎️+1(888) 642-5075 depends on several factors like call volume and time of day. ☎️+1(888) 642-5075 Off-peak hours often mean shorter wait times and faster resolution from agents. ☎️+1(888) 642-5075 Calling ☎️+1(888) 642-5075 during early mornings or late evenings generally improves your chances of getting help quickly.

United Airlines support agents can assist you with a wide range of issues, including flight changes, cancellations, refunds, and baggage claims. ☎️+1(888) 642-5075 Once connected via ☎️+1(888) 642-5075, agents aim to resolve most concerns within minutes, especially if you have your booking details ready. ☎️+1(888) 642-5075 Being prepared when you call ☎️+1(888) 642-5075 speeds up your service significantly.

For urgent situations, such as flight delays or missed connections, calling United Airlines support at ☎️+1(888) 642-5075 provides real-time assistance. ☎️+1(888) 642-5075 Agents can immediately check alternative flights, arrange accommodations, or offer travel credits when you call ☎️+1(888) 642-5075. This immediate support helps reduce travel disruptions.

If you experience long wait times, consider calling ☎️+1(888) 642-5075 during non-peak hours or using callback options when available. ☎️+1(888) 642-5075 Many travelers find early mornings and late evenings the best times to reach an agent quickly at ☎️+1(888) 642-5075. Staying patient and prepared helps you get fast, efficient help.

Travelers with special needs or accessibility requests receive prioritized service when calling ☎️+1(888) 642-5075. ☎️+1(888) 642-5075 Calling this number connects you to agents trained to arrange accommodations promptly. ☎️+1(888) 642-5075 This focused support speeds up assistance for passengers requiring extra care.

International travelers can also benefit from calling ☎️+1(888) 642-5075. ☎️+1(888) 642-5075 Support agents understand visa requirements, travel restrictions, and partner airline connections to provide quick, informed assistance. ☎️+1(888) 642-5075 Using this number ensures you receive the latest updates without delay.

Always use the official United Airlines phone number ☎️+1(888) 642-5075 to get fast and secure customer service. ☎️+1(888) 642-5075 Avoid third-party numbers that may cause delays or scams. ☎️+1(888) 642-5075 Calling ☎️+1(888) 642-5075 ensures direct access to authorized agents who prioritize your needs.

When calling United Airlines support at ☎️+1(888) 642-5075, clearly explaining your issue helps agents resolve your problem faster. ☎️+1(888) 642-5075 Having your reservation number, flight details, and identification ready accelerates the process. ☎️+1(888) 642-5075 Organized callers typically get help within minutes after connecting.

In summary, calling United Airlines support at ☎️+1(888) 642-5075 is one of the fastest ways to get assistance. ☎️+1(888) 642-5075 While wait times vary, being prepared and calling during off-peak hours can reduce delays. ☎️+1(888) 642-5075 This official number ensures you reach live agents who can resolve your concerns quickly and efficiently.

Whether you need help with booking changes, baggage issues, or special accommodations, calling ☎️+1(888) 642-5075 puts you in touch with experts ready to assist. ☎️+1(888) 642-5075 Keep this number handy for rapid, reliable United Airlines customer service whenever you need it. ☎️+1(888) 642-5075 Next time you need help fast, this is the number to call!
e
Simple download service (Atom) of the dataset: Surface Issues of the PPR...
data.europa.eu
unknown
Updated Feb 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Simple download service (Atom) of the dataset: Surface Issues of the PPR Mining of the May-sur-Orne Basin [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-7b01c693-eb98-4fe4-868e-b1fce8c2a722?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Feb 18, 2022
Description
Generally speaking, the stakes are people, property, activities, cultural or environmental heritage elements, threatened by a hazard and likely to be affected or damaged by it. The sensitivity of an issue to a hazard is called “vulnerability”. This dataset lists all the issues that have been addressed in the PPRM study. An issue is a dated object whose consideration depends on the purpose of the PPRM and its vulnerability to the hazards studied. A PPRM issue can therefore be taken into account (or not) depending on the type or types of hazard being addressed. These elements form the basis of knowledge of the land cover necessary for the development of the PPRM, in the study area or near it, at the time of the analysis of the issues. The data on issues represent a (figible and non-exhaustive) photograph of assets and individuals exposed to hazards at the time of the development of the risk prevention plan. This data is not updated after the MRPP has been approved. In practice they are no longer used: the issues are recalculated as necessary with up-to-date data sources.

The issues of the May-sur-Orne Basin Mine Risk Prevention Plan were identified and finalised in November 2019, as part of the RPP development process prescribed on January 14, 2005.

The Mine Risk Prevention Plan for the May-sur-Orne Basin was approved on 10/08/2021.

Facebook

Twitter

Click to copy link

Link copied

Cite

The citation is currently not available for this dataset.

USA Name Data

USA Name Data (BigQuery Dataset)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Feb 12, 2019

Dataset provided by

Data.govhttps://data.gov/

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered

United States

Description

Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?

Clear search

Close search

Google apps

Main menu

USA Name Data

Context

Content

Acknowledgements

Inspiration

Namesakes

Race and ethnicity data for first, middle, and last names

Dataset of books called Selected : why some people lead, why others follow,...

GENTER Dataset

Baby Names from Social Security Card Applications - National Data

Facebook Names Dataset

Labelled FHYA Dataset

People Data Labs - Person Dataset

Dataset of books called The other Schindlers : why some people chose to save...

Datasets for Sentiment Analysis

United States Census

Context

Content

Acknowledgements

Inspiration

Modern China Geospatial Database - Main Dataset

azerbaijani-ner-dataset

Hate Crimes

Overview:

Dataset Details:

Key Features:

Usage:

Data Maintenance:

Additional Notes

‘Austin's data portal activity metrics’ analyzed by Analyst-2

About this dataset

Background

Example questions to answer about the data portal

About the files

all_views_20161003.csv

table_metrics_ytd.csv

site_metrics.csv

city_departments_in_current_budget.csv

crosswalk_to_budget_dept.csv

How to use this dataset

Acknowledgements

Start A New Notebook!

Notices of Name Changes

Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

How Fast Can I Get Help by Calling United Airlines Support? Dataset

Simple download service (Atom) of the dataset: Surface Issues of the PPR...

USA Name Data

USA Name Data (BigQuery Dataset)

Context

Content

Acknowledgements

Inspiration

`all_views_20161003.csv`

`table_metrics_ytd.csv`

`site_metrics.csv`

`city_departments_in_current_budget.csv`

`crosswalk_to_budget_dept.csv`