https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .
Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.
Dataset Details Dataset Description
This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.
This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).
From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.
Dataset Sources
Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus
NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.
Dataset Structure
text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus
Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1
Dataset Creation Curation Rationale
For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.
Source Data
The dataset is derived from BookCorpus by filtering it and extracting the template structure.
We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.
Data Collection and Processing
We filter the entries of BookCorpus and include only sentences that meet the following criteria:
Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.
This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.
The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.
Bias, Risks, and Limitations
Due to BookCorpus, only lower-case sentences are contained.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I could't find it easily so that is why I created this dataset on my own.
List of polish first and last names. Can be used for language generative models.
I don't remember the source of this dataset but it was public domain. If anyone knows where is it from please let me know.
I wanted to create polish name generator.
This dataset contains short sentences linking a first name, represented by the template mask [NAME], to stereotypical associations.
Dataset Details Dataset Example
An example of the dataset looks as follows: json { "text": "My friend, [NAME], excels in the field of mechanical engineering.", "gender": "M", "reason": "Profession" }
Dataset Sources
Generated using GPT-4o using the following prompt: ``` Generate a copyable CSV dataset with 500 sentences that demonstrate gender bias using the placeholder [NAME] for a name. The dataset should include four columns: "text," "gender," and "reason,". The "gender" column should contain 'M' for male or 'F' for female. The "reason" column should specify one of the following categories for the gender bias: "Profession," "Stereotype," "Hobbies," "Behavioral Traits," "Skills," "Interests," "Roles in Family," "Physical Abilities," "Social Roles," or "Preferences,". Provide at least 50 sentences for each reason category. Ensure the sentences are diverse and avoid repetitions. Do not use any words that indicate the gender such as "him", "his", "actress", …
The csv file should look like the following: text,gender,reason "[NAME] is an outstanding pilot with thousands of flight hours.",M,Profession "[NAME] is a fantastic hair stylist with many loyal clients.",F,Profession "[NAME] is probably into video games.",M,Interests "[NAME] is likely to be more empathetic.",F,Behavioral Traits ```
As long as the total number of generated entries were below 500, the dataset was iteratively expanded by repeatedly prompting GPT-4o with "More". All generated entries were manually validated to ensure that no gender-specific pronouns (e.g., he, she, his, etc.) were present. Entries containing such pronouns were excluded. The final dataset size was capped at 500 entries.
Uses
The data can be used to asses the gender bias of language models by considering it as a Masked Language Modeling (MLM) task.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("My friend, [MASK], excels in the field of mechanical engineering.")
[{
'score': 0.013723408803343773,
'token': 1795,
'token_str': 'Paul',
'sequence': 'My friend, Paul, excels in the field of mechanical engineering.'
}, {
'score': 0.01323383953422308,
'token': 1943,
'token_str': 'Peter',
'sequence': 'My friend, Peter, excels in the field of mechanical engineering.'
}, {
'score': 0.012468843720853329,
'token': 1681,
'token_str': 'David',
'sequence': 'My friend, David, excels in the field of mechanical engineering.'
}, {
'score': 0.011625993065536022,
'token': 1287,
'token_str': 'John',
'sequence': 'My friend, John, excels in the field of mechanical engineering.'
}, {
'score': 0.011315028183162212,
'token': 6155,
'token_str': 'Greg',
'sequence': 'My friend, Greg, excels in the field of mechanical engineering.'
}]
unmasker("My friend, [MASK], makes a wonderful kindergarten teacher.")
[{
'score': 0.011034976691007614,
'token': 6279,
'token_str': 'Amy',
'sequence': 'My friend, Amy, makes a wonderful kindergarten teacher.'
}, {
'score': 0.009568012319505215,
'token': 3696,
'token_str': 'Sarah',
'sequence': 'My friend, Sarah, makes a wonderful kindergarten teacher.'
}, {
'score': 0.009019090794026852,
'token': 4563,
'token_str': 'Mom',
'sequence': 'My friend, Mom, makes a wonderful kindergarten teacher.'
}, {
'score': 0.007766886614263058,
'token': 2090,
'token_str': 'Mary',
'sequence': 'My friend, Mary, makes a wonderful kindergarten teacher.'
}, {
'score': 0.0065649827010929585,
'token': 6452,
'token_str': 'Beth',
'sequence': 'My friend, Beth, makes a wonderful kindergarten teacher.'
}]
``
Notice, that you need to replace[NAME]by the tokenizer mask token, e.g.,[MASK]` in the provided example.
Along with a name dataset (e.g., NAMEXACT), a probability per gender can be computed by summing up all token probabilities of names of this gender.
Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
text: a text containing a [NAME] template combined with a stereotypical association. Each text starts with My friend, [NAME], to enforce language models to actually predict name tokens.
gender: Either F (female) or M (male), i.e., the stereotypical stronger associated gender (according to GPT-4o)
reason: A reason as one of nine categories (Hobbies, Skills, Roles in Family, Physical Abilities, Social Roles, Profession, Interests)
An example of the dataset looks as follows:
json
{
"text": "My friend, [NAME], excels in the field of mechanical engineering.",
"gender": "M",
"reason": "Profession"
}
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Column Name | Description |
---|---|
city_name | The name of the city where healthcare providers are located. |
result_count | The count of healthcare providers in the city. |
results | Details of healthcare providers in the city. |
created_epoch | The epoch timestamp when the provider's information was created. |
enumeration_type | The type of enumeration for the provider (e.g., NPI-1, NPI-2). |
last_updated_epoch | The epoch timestamp when the provider's information was last updated. |
number | The unique identifier for the healthcare provider. |
addresses | Information about the provider's addresses, including mailing and location addresses. |
country_code | The country code for the provider's address (e.g., US for the United States). |
country_name | The country name for the provider's address. |
address_purpose | The purpose of the address (e.g., MAILING, LOCATION). |
address_type | The type of address (e.g., DOM - Domestic). |
address_1 | The first line of the provider's address. |
address_2 | The second line of the provider's address. |
city | The city where the provider is located. |
state | The state where the provider is located. |
postal_code | The postal code or ZIP code for the provider's location. |
telephone_number | The telephone number for the provider's contact. |
practiceLocations | Details about the provider's practice locations. |
basic | Basic information about the provider, including their name, credentials, and gender. |
first_name | The first name of the healthcare provider. |
last_name | The last name of the healthcare provider. |
middle_name | The middle name of the healthcare provider. |
credential | The credential of the healthcare provider (e.g., PT, DPT). |
sole_proprietor | Indicates whether the provider is a sole proprietor (e.g., YES, NO). |
gender | The gender of the healthcare provider (e.g., M, F). |
enumeration_date | The date when the provider's enumeration was recorded. |
last_updated | The date when the provider's information was last updated. |
taxonomies | Information about the provider's taxonomies, including code, description, state, license, and primary designation. |
identifiers | Additional identifiers for the healthcare provider. |
endpoints | Information about communication endpoints for the provider. |
other_names | Any other names associated with the healthcare provider. |
1. Healthcare Provider Analysis: This dataset can be used to perform in-depth analyses of healthcare providers across various cities. You can extract insights into the distribution of different types of healthcare professionals, their practice locations, and their specialties. This information is valuable for healthcare workforce planning and resource allocation.
2. Geospatial Mapping: Utilize the city names and addresses in the dataset to create geospatial visualizations. You can map the locations of healthcare providers in each city, helping stakeholders identify areas with potential shortages or surpluses of healthcare services.
3. Provider Directory Development: The dataset provides detailed information about healthcare providers, including their names, contact details, and credentials. You can use this data to build a comprehensive healthcare provider directory or search tool, helping patients and healthcare organizations find and connect with the right providers in their area.
If you find this dataset useful, give it an upvote – it's a small gesture that goes a long way! Thanks for your support. 😄
Success.ai’s LinkedIn Data Solutions offer unparalleled access to a vast dataset of 700 million public LinkedIn profiles and 70 million LinkedIn company records, making it one of the most comprehensive and reliable LinkedIn datasets available on the market today. Our employee data and LinkedIn data are ideal for businesses looking to streamline recruitment efforts, build highly targeted lead lists, or develop personalized B2B marketing campaigns.
Whether you’re looking for recruiting data, conducting investment research, or seeking to enrich your CRM systems with accurate and up-to-date LinkedIn profile data, Success.ai provides everything you need with pinpoint precision. By tapping into LinkedIn company data, you’ll have access to over 40 critical data points per profile, including education, professional history, and skills.
Key Benefits of Success.ai’s LinkedIn Data: Our LinkedIn data solution offers more than just a dataset. With GDPR-compliant data, AI-enhanced accuracy, and a price match guarantee, Success.ai ensures you receive the highest-quality data at the best price in the market. Our datasets are delivered in Parquet format for easy integration into your systems, and with millions of profiles updated daily, you can trust that you’re always working with fresh, relevant data.
Global Reach and Industry Coverage: Our LinkedIn data covers professionals across all industries and sectors, providing you with detailed insights into businesses around the world. Our geographic coverage spans 259M profiles in the United States, 22M in the United Kingdom, 27M in India, and thousands of profiles in regions such as Europe, Latin America, and Asia Pacific. With LinkedIn company data, you can access profiles of top companies from the United States (6M+), United Kingdom (2M+), and beyond, helping you scale your outreach globally.
Why Choose Success.ai’s LinkedIn Data: Success.ai stands out for its tailored approach and white-glove service, making it easy for businesses to receive exactly the data they need without managing complex data platforms. Our dedicated Success Managers will curate and deliver your dataset based on your specific requirements, so you can focus on what matters most—reaching the right audience. Whether you’re sourcing employee data, LinkedIn profile data, or recruiting data, our service ensures a seamless experience with 99% data accuracy.
Key Use Cases:
LinkedIn URL: Access direct links to LinkedIn profiles for immediate insights. Full Name: Verified first and last names. Job Title: Current job titles, and prior experience. Company Information: Company name, LinkedIn URL, domain, and location. Work and Per...
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
🌍 Global B2B Person Dataset | 755M+ LinkedIn Profiles | Verified & Bi-Weekly Updated Access the world’s most comprehensive professional dataset, enriched with over 755 million LinkedIn profiles. The Forager.ai Global B2B Person Dataset delivers work-verified professional contacts with 95%+ accuracy, refreshed every two weeks. Ideal for recruitment, sales, research, and talent mapping, it provides direct access to decision-makers, specialists, and executives across industries and geographies.
Dataset Features Full Name & Job Title: Up-to-date first/last name with current professional role.
Emails & Phone Numbers: AI-validated work and personal email addresses, plus mobile numbers.
Company Info: Current employer name, industry, and company size (employee count).
Career History: Detailed work history with job titles, durations, and role progressions.
Skills & Endorsements: Extracted from public LinkedIn profiles.
Education & Certifications: Universities, degrees, and professional certifications.
Location & LinkedIn URL: City, country, and direct link to public LinkedIn profile.
Distribution Data Volume: 755M+ total profiles, with 270M+ containing full contact information.
Formats Available: CSV, JSON via S3 or Snowflake; API for real-time access.
Access Methods: REST API, Enrichment API (lookup), full dataset delivery, or custom solutions.
Usage This dataset is ideal for a variety of applications:
Executive Recruitment: Source passive talent, build role-based maps, and assess mobility.
Sales Intelligence: Find decision-makers, personalize outreach, and trigger campaigns on job changes.
Market Research: Understand talent concentration by company, geography, and skill set.
Partnership Development: Identify key stakeholders in target firms for business development.
Talent Mapping & Strategic Hiring: Build full organizational charts and skill distribution heatmaps.
Coverage Geographic Coverage: Global – including North America, EMEA, LATAM, and APAC.
Time Range: Continuously updated; profiles refreshed bi-weekly.
Demographics: Cross-industry coverage of seniority levels from entry-level to C-suite, across all sectors.
License CUSTOM
Who Can Use It Recruiters & Staffing Firms: For building target lists and sourcing niche talent.
Sales & RevOps Teams: For targeting by department, title, or decision-making authority.
VCs & PE Firms: To assess leadership teams and monitor executive movement.
Data Scientists & Analysts: To train models for job mobility, hiring trends, or org structure prediction.
B2B Platforms: For enriching internal databases and powering account-based marketing (ABM).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
French-language art critics active between the mid-19th and 20th centuries (see http://critiquesdart.univ-paris1.fr/en). Each author’s page displays their primary and secondary bibliographies and identified archival sources. These documents are based on extensive research to produce primary bibliographies that can be considered to be comprehensive. For those writers whose complete works have been published, only their writing on art is listed. This online directory doubles as a database of primary bibliographies for the critics listed.
This dataset is neutral and non-discriminating, with no typology or classification other than the text format. It is multidisciplinary and makes no claim to assess the critical value of a text, preferring to provide as comprehensive an overview as possible of the authors’ literary output in whatever field (literature, art, politics, history, etc.) or medium (photography, film, fine arts, architecture, etc.) their interest encompassed.
Since these authors all wrote in French, they come from the artistic scenes in France, Belgium and Switzerland, not to mention the colonies of the period. Authors who wrote in more than one language are also included.
Our intention is to showcase research into art criticism, improve access to the documents and create links between researchers. The site will be gradually updated and enriched by the addition of further authors and greater detail for its current bibliographies.
Data set 1 : " database_journal_list.csv " description (en/fr) :
description of the fields:
"title": title of the journal
"ISSN": International identifier for serial publications
"temporal_cover": Start and end dates of publication
"city": place of publication, if known, otherwise NULL
descriptif des champs :
"titre": titre de la revue
"ISSN": Identifiant International des publications en série
"couverture_temporelle": Dates de début et de fin de publication
"ville": lieu de publication, si connu, sinon NULL
Dataset 2 : " Notices_critiques.csv " description (en/fr) : list of all the critics in the database (see http://critiquesdart.univ-paris1.fr/annuaire_critiques.php).
description of the fields:
"id": identifier in the database
"first name": First name of the author, if known, otherwise empty
"name": Author's last name
"birth": Year of birth
"death": date of death
"ISNI": International Standard Name Identifying people
descriptif des champs :
"id ": identifiant dans la base
"prenom ": Prénom de l'auteur, si connu, sinon vide
"nom": Nom de famille de l'auteur
"naissance": Année de naissance
"mort": date de mort
"ISNI": International Standard Name Identifier des personnes
Dataset 3 : "full_notices_dataset.csv" description (en/fr) : List of all the notices
description of the fields:
"critical_id": id of the review in the database
"title": title of the review
"subtitle": subtitle of the review, otherwise NULL
"type": type of criticism
"author_id": author's id in the database
"name": name of the author reviewer
"first name": first name of the author critic
"annee_ouvrage": year of publication if the review is a work or part of a work, otherwise NULL
"coordinator": potential coordinator of the structure, otherwise NULL
"titre_ouvrage": title of the work if the review is an article, otherwise NULL
"annee_periodique *": year of publication if the review is an article, otherwise NULL
"periodic_title": name of the journal if the review is an article, otherwise NULL
"city": place of publication, if known, otherwise NULL
descriptif des champs :
"id_critique": id de la critique dans la base
"titre": titre de la critique
"sous titre": sous titre de la critique, sinon NULL
"type": type de critique
"id_auteur": id de l'auteur dans la base
"nom": nom du critique auteur
"prenom": prénom du critique auteur
"annee_ouvrage": année de parution si la critique est un ouvrage ou partie d'un ouvrage, sinon NULL
"coordonateur": coordonateur potentiel de l'ouvrage, sinon NULL
"titre_ouvrage": titre de l'ouvrage si la critique est un article, sinon NULL
"annee_periodique *": année de parution si la critique est un article, sinon NULL
"titre_periodique": nom de la revue si la critique est un article, sinon NULL
"ville": lieu de publication, si connu, sinon NULL
→ Motivations for creating the dataset Crossing data with political color may be interesting in many areas of analysis. However, there is now the National Register of Elected Representatives (RNE) which lists all elected officials by mandate, but it does not inform their political party(ies). This published dataset therefore corresponds to the municipal NER (mayors only) enriched with the political color retrieved from the results of the municipal elections in rounds 1 and 2, available on OpenDataSoft. → Composition of the dataset The database consists of the following fields: - municipal_name: name of the municipality - cog_commune: Official Geographic Code (COG) of the municipality (also known as the INSEE code) - siren_commune: SIREN number of the municipality - name_firstname_mayor: NAME and first name of the mayor of the municipality - political_nuance: acronym for the political nuance of the mayor - nuance_family: political nuance family, created from the circular on the attribution of political nuances to candidates in the municipal and community elections of 15 and 22 March 2020 The data is available in CSV format encoded in UTF-8, with a comma separator. → Data collection and processing process The data are extracted from the NER file "_elus-conseillers-municipaux-cm.csv_" available at datagouv. These data were processed with the following steps before being enriched with the political colour: - filter of the wording of the elected official's position on mayors only - creation of a column gathering the COG of the municipality with the name and surname of the mayor in the format 'NAME First name', which will be used to join (to avoid recovering several political parties when several elected officials have the same name and surname) The data of the political colour come from the results of the municipal elections, turn 1 and turn 2. They were processed with the following steps before being attached to the NER data: - gathering of the 2 games in 1 - creation of a column gathering the COG of the municipality with the names and surnames of the candidates in the format 'NAME First name' The 2 datasets could thus be joined by the COG column and surname / first name, allowing to retrieve the code of the political nuance from the results of the municipal elections of 2020. The nuance family was also created from the circular Légifrance, when mayors have 2 political colors and they are not part of the same family, only the first one was considered to assign the political color, for the sake of simplifying the output data. Finally, the SIREN number of the municipality has been added from the dataset "_Identifiers of local and regional authorities and their establishments_" available on datagouv, to allow the enrichment of this dataset published with different external sources. The data extraction and processing script is available at this link. → Dissemination of the dataset The dataset is published on the data.gouv.fr portal with the Datactivist account under Open License as the files used for its creation. To quote this dataset, indicate: Source Datactivist (2024-11-08) → Dataset maintenance Ideally, the update should be done at each municipal election. Datactivist does not undertake to carry out this update, but makes available scripts to generate new versions. If you have any questions or problems, please contact diane@datactivist.coop or post a comment below.
Effective March 1, 2019, the Municipal Conflict of Interest Act, requires municipalities to make a registry of declarations of interest available for public inspection. These open data sets allow you to review the record of declared interests for Members of Council. Verbal declarations are recorded and published as part of the minutes of each meeting. Changes to the Municipal Conflict of Interest Act now require Members to submit a written declaration in addition to making a verbal declaration. Both are captured in this registry as of March 1, 2019. Records prior to March 1, 2019 contain only verbal declarations and the Act does not require retroactive written declarations. Description of the data The declaration data for an item becomes available when the minutes for a meeting have been published. Committee: The name of the committee, board or Council where the item was considered and the declaration was made First Name: First name of Member Last Name: Last name of Member Item: The agenda item number as a hyperlink Item Title: The agenda item title Receive Date: The receive date is the date the written declaration was received by the Clerk. Verbal Declaration: The verbal declaration as recorded by the Clerk or board secretary Verbal Declaration Date: The date that the Member made the verbal declaration Written declarations are links to a PDF copy of the original written document presented to the Clerk. If you require assistance with these documents, please contact the appropriate Committee or Council team. Members listed also include public Members appointed to Boards by City Council.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4560787%2F1bf7d8acca3f6ca6adbae87c95df1f33%2F1_MIXrCZ0QAVp6qoElgWea-A.jpg?generation=1697784111548502&alt=media" alt="">
Data is the new oil, and this dataset is a wellspring of knowledge waiting to be tapped😷!
Don't forget to upvote and share your insights with the community. Happy data exploration!🥰
** For more related datasets: ** https://www.kaggle.com/datasets/rajatsurana979/fifafcmobile24 https://www.kaggle.com/datasets/rajatsurana979/most-streamed-spotify-songs-2023 https://www.kaggle.com/datasets/rajatsurana979/comprehensive-credit-card-transactions-dataset https://www.kaggle.com/datasets/rajatsurana979/hotel-reservation-data-repository https://www.kaggle.com/datasets/rajatsurana979/percent-change-in-consumer-spending https://www.kaggle.com/datasets/rajatsurana979/fast-food-sales-report/data
Description: Welcome to the world of credit card transactions! This dataset provides a treasure trove of insights into customers' spending habits, transactions, and more. Whether you're a data scientist, analyst, or just someone curious about how money moves, this dataset is for you.
Features: - Customer ID: Unique identifiers for every customer. - Name: First name of the customer. - Surname: Last name of the customer. - Gender: The gender of the customer. - Birthdate: Date of birth for each customer. - Transaction Amount: The dollar amount for each transaction. - Date: Date when the transaction occurred. - Merchant Name: The name of the merchant where the transaction took place. - Category: Categorization of the transaction.
Why this dataset matters: Understanding consumer spending patterns is crucial for businesses and financial institutions. This dataset is a goldmine for exploring trends, patterns, and anomalies in financial behavior. It can be used for fraud detection, marketing strategies, and much more.
Acknowledgments: We'd like to express our gratitude to the contributors and data scientists who helped curate this dataset. It's a collaborative effort to promote data-driven decision-making.
Let's Dive In: Explore, analyze, and visualize this data to uncover the hidden stories in the world of credit card transactions. We look forward to seeing your innovative analyses, visualizations, and applications using this dataset.
This dataset represents the popular last names in the United States for White.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
This dataset is still under internal assessment. Please use it with caution! To create this dataset, we first generate responses from the base model using URIAL as rejected. Then, we generate responses from 8B, 70B, and 405B models, and take the instruction-response pair with the highest reward as chosen.
Other Magpie DPO Datasets
We observed that the following DPO datasets may have better performance after we burned a lot of GPU hours :)
Model Name Dataset Type Description… See the full description on the dataset page: https://huggingface.co/datasets/Magpie-Align/Magpie-DPO-100K-SML.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.
Our tabular dataset offers comprehensive B2B contact information extracted from import and export trades designed to fuel lead generation efforts. With meticulous field-checking processes, our data is a reliable resource for businesses seeking to expand their networks and explore new trade opportunities.
Each entry in our dataset undergoes rigorous validation protocols to ensure accuracy and completeness. Our quality control measures include cross-referencing multiple sources, verifying contact details, and validating trade information against authoritative databases. Maintaining high data integrity standards guarantees that our clients receive actionable insights to drive their business strategies forward.
The dataset encompasses many industries, capturing import and export trades across diverse sectors and regions. Our dataset provides a panoramic view of global trade dynamics from manufacturing to technology, agriculture to healthcare. With detailed information on products, quantities, and trading partners, businesses can identify promising leads, forge strategic partnerships, and capitalize on emerging market trends.
Our dataset offers substantial coverage in terms of scale, encompassing millions of trade transactions and B2B contacts worldwide. Whether clients seek to explore new markets, source reliable suppliers, or connect with potential buyers, our dataset is a valuable asset for informed decision-making.
On the data marketplace, we offer flexible licensing options tailored to meet the diverse needs of our clients. Whether they require a subset of data for targeted campaigns or the entire dataset for comprehensive market analysis, we provide customizable solutions to accommodate varying requirements.
Our commitment to transparency and data privacy ensures that clients can confidently leverage our dataset, knowing that their information is handled with the utmost care and security. We adhere to stringent data protection regulations and industry best practices, safeguarding sensitive information and fostering trust among our clientele.
Our tabular dataset of import and export trades B2B contacts represents a goldmine of opportunities for businesses seeking to expand their global footprint. With unparalleled accuracy, breadth, and flexibility, it is a cornerstone for successful lead generation and strategic decision-making in today's dynamic marketplace.
Fields: - First Name - Last Name - Title - Company - Company Name for Emails - Email - Seniority - Departments - First Phone - Corporate Phone - Employees - Industry - Person Linkedin Url - Website - Company Linkedin Url - Facebook Url - City - State - Country - Company Address - Company City - Company State - Company Country - Company Phone - Technologies - Annual Revenue
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a detailed record of outcomes from over 300 episodes of "Beat Bobby Flay", a popular American television culinary competition. The show features two rounds: initially, two challenger chefs create a dish using a specific ingredient, with guest judges selecting one to advance. In the second round, the victorious challenger selects a signature dish for both themselves and renowned chef Bobby Flay to prepare, with an expert panel of judges deciding the winner through a blind taste test. A notable aspect of the show, highlighted by the data, is Bobby Flay's impressive approximately 63% win rate, despite the challenger having the advantage of choosing the dish for the final cook-off. This dataset offers unique opportunities to explore factors influencing culinary competition outcomes, investigate potential judge biases, and analyse the dynamics of high-stakes cooking challenges.
The dataset is structured with each row representing an individual episode of the "Beat Bobby Flay" television show. It contains 306 unique records, providing a substantial collection of episode outcomes. The format is typically tabular, suitable for common data analysis tools.
This dataset is ideal for: * Analysing dish effectiveness: Determine which types of dishes tend to offer the best or worst chance of beating Bobby Flay. * Investigating judge impartiality: Examine if any patterns suggest biases among the second-round judges, despite the blind taste test format. * Studying show trends: Observe changes in ingredients, popular dishes, or contestant performance over time. * Machine learning applications: Develop models for classification tasks, such as predicting episode winners, or for natural language processing (NLP) on episode titles or dish names. * Culinary insights: Gain understanding into competitive cooking strategies and popular food items.
The dataset covers episodes of the "Beat Bobby Flay" television series aired from 24 August 2013 to 19 July 2020. While an American TV show, its analysis potential is global. It includes results for 306 episodes across various seasons.
CC0
Original Data Source: Beat Bobby Flay: Results of 300+ episodes
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on