100+ datasets found
  1. o

    Data Science Career Opportunities (USA)

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Data Science Career Opportunities (USA) [Dataset]. https://www.opendatabay.com/data/ai-ml/6d1c5965-8fb2-4749-a8bd-f1c40861b401
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics, United States
    Description

    This dataset provides valuable insights into the US data science job market, containing detailed job listings scraped from the Indeed web portal on 20th November 2022. It is ideal for those seeking to understand job trends, analyse salary expectations, or develop skills in data analysis, machine learning, and natural language processing. The dataset's purpose is to offer a snapshot of available positions across various data science roles, including data scientists, machine learning engineers, and business analysts. It serves as a rich resource for exploratory data analysis, feature engineering, and predictive modelling tasks.

    Columns

    • Title: The job title of the listed position.
    • Company: The hiring company posting the job.
    • Location: The geographic location of the job within the US.
    • Rating: The rating associated with the job or company.
    • Date: Indicates how long the job had been posted prior to 20th November 2022.
    • Salary: The salary information provided in US Dollars ($). Please note that many entries in this column may be missing as salary details are often not disclosed in job listings.
    • Description: A brief summary description of the job.
    • Links: The direct link to the original job posting on the Indeed platform.
    • Descriptions: The full-length description of the job, encompassing all details found in the complete job posting.

    Distribution

    This dataset is provided as a single data file, typically in CSV format. It comprises 1200 rows (records) and 9 distinct columns. The file name is data_science_jobs_indeed_us.csv.

    Usage

    This dataset is perfectly suited for a variety of analytical tasks and applications: * Data Cleaning and Preparation: Practise handling missing values, especially in the 'Salary' column. * Exploratory Data Analysis (EDA): Discover trends in job titles, company types, and locations. * Feature Engineering: Extract new features from the 'Descriptions' column, such as required skills, education levels, or experience. * Classification and Clustering: Develop models for salary prediction, or perform skill clustering analysis to guide curriculum development. * Text Processing and Natural Language Processing (NLP): Analyse job descriptions to identify common skill demands or industry buzzwords.

    Coverage

    The dataset's geographic scope is limited to job postings within the United States. All data was collected on 20th November 2022, with the 'Date' column providing information on how long each job had been active before this date. The dataset covers a wide range of data science positions, including roles such as data scientist, machine learning engineer, data engineer, business analyst, and data science manager. It is important to note the presence of many missing entries in the 'Salary' column, reflecting common data availability challenges in job listings.

    License

    CCO

    Who Can Use It

    This dataset is an excellent resource for: * Aspiring Data Scientists and Machine Learning Engineers: To sharpen their data cleaning, EDA, and model deployment skills. * Educators and Curriculum Developers: To inform and guide the development of relevant data science and analytics courses based on real-world job market demands. * Job Seekers: To understand the current landscape of data science roles, required skills, and potential salary ranges. * Researchers and Analysts: To glean insights into labour market trends in the data science domain. * Human Resources Professionals: To benchmark job roles, skill requirements, and compensation within the industry.

    Dataset Name Suggestions

    • Indeed US Data Science Job Insights
    • US Data Science Job Market Analysis
    • Data Professional Job Postings (Indeed USA)
    • Data Science Career Opportunities (USA)

    Attributes

    Original Data Source: Data Science Job Postings (Indeed USA)

  2. F

    Cleaning and Sanitation Job Postings on Indeed in the United States

    • fred.stlouisfed.org
    json
    Updated Jul 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Cleaning and Sanitation Job Postings on Indeed in the United States [Dataset]. https://fred.stlouisfed.org/series/IHLIDXUSTPCLSA
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jul 15, 2025
    License

    https://fred.stlouisfed.org/legal/#copyright-pre-approvalhttps://fred.stlouisfed.org/legal/#copyright-pre-approval

    Area covered
    United States
    Description

    Graph and download economic data for Cleaning and Sanitation Job Postings on Indeed in the United States (IHLIDXUSTPCLSA) from 2020-02-01 to 2025-07-11 about cleaning, jobs, and USA.

  3. N

    Lot Cleaning Dispositions (No Longer Maintained)

    • data.cityofnewyork.us
    • data.ny.gov
    • +1more
    application/rdfxml +5
    Updated Mar 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Sanitation (DSNY) (2016). Lot Cleaning Dispositions (No Longer Maintained) [Dataset]. https://data.cityofnewyork.us/w/r4c5-ndkx/25te-f2tw?cur=5U13FyxulmR
    Explore at:
    application/rssxml, xml, csv, tsv, json, application/rdfxmlAvailable download formats
    Dataset updated
    Mar 29, 2016
    Dataset authored and provided by
    Department of Sanitation (DSNY)
    Description

    This dataset is based on a legacy system that is no longer maintained by the NYC Department of Sanitation (DSNY). It contains information about the cleaning requests associated with different vacant lots.

    Some current data related to vacant lots can be found in the 311 Service Requests dataset at (https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9), and filtering the "Complaint Type" field for "Vacant Lot".

  4. d

    Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Hungary, Guinea-Bissau, Guatemala, Guadeloupe, Chile, Saint Barthélemy, Namibia, Niue, Panama, Andorra
    Description

    This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  5. D

    Data Cleansing Tools Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Cleansing Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/data-cleansing-tools-1398134
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data cleansing tools market is experiencing robust growth, driven by the escalating volume and complexity of data across various sectors. The increasing need for accurate and reliable data for decision-making, coupled with stringent data privacy regulations (like GDPR and CCPA), fuels demand for sophisticated data cleansing solutions. Businesses, regardless of size, are recognizing the critical role of data quality in enhancing operational efficiency, improving customer experiences, and gaining a competitive edge. The market is segmented by application (agencies, large enterprises, SMEs, personal use), deployment type (cloud, SaaS, web, installed, API integration), and geography, reflecting the diverse needs and technological preferences of users. While the cloud and SaaS models are witnessing rapid adoption due to scalability and cost-effectiveness, on-premise solutions remain relevant for organizations with stringent security requirements. The historical period (2019-2024) showed substantial growth, and this trajectory is projected to continue throughout the forecast period (2025-2033). Specific growth rates will depend on technological advancements, economic conditions, and regulatory changes. Competition is fierce, with established players like IBM, SAS, and SAP alongside innovative startups continuously improving their offerings. The market's future depends on factors such as the evolution of AI and machine learning capabilities within data cleansing tools, the increasing demand for automated solutions, and the ongoing need to address emerging data privacy challenges. The projected Compound Annual Growth Rate (CAGR) suggests a healthy expansion of the market. While precise figures are not provided, a realistic estimate based on industry trends places the market size at approximately $15 billion in 2025. This is based on a combination of existing market reports and understanding of the growth of related fields (such as data analytics and business intelligence). This substantial market value is further segmented across the specified geographic regions. North America and Europe currently dominate, but the Asia-Pacific region is expected to exhibit significant growth potential driven by increasing digitalization and adoption of data-driven strategies. The restraints on market growth largely involve challenges related to data integration complexity, cost of implementation for smaller businesses, and the skills gap in data management expertise. However, these are being countered by the emergence of user-friendly tools and increased investment in data literacy training.

  6. F

    Cleaning and Sanitation Job Postings on Indeed in France

    • fred.stlouisfed.org
    json
    Updated Jul 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Cleaning and Sanitation Job Postings on Indeed in France [Dataset]. https://fred.stlouisfed.org/series/IHLIDXFRTPCLSA
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jul 9, 2025
    License

    https://fred.stlouisfed.org/legal/#copyright-pre-approvalhttps://fred.stlouisfed.org/legal/#copyright-pre-approval

    Area covered
    France
    Description

    Graph and download economic data for Cleaning and Sanitation Job Postings on Indeed in France (IHLIDXFRTPCLSA) from 2020-02-01 to 2025-07-04 about cleaning, jobs, France, and USA.

  7. o

    Job Market Twitter Data

    • opendatabay.com
    .undefined
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Job Market Twitter Data [Dataset]. https://www.opendatabay.com/data/ai-ml/b4559931-b555-4af3-b339-ba6699b33fb2
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    This dataset consists of 50,000 tweets pertaining to job vacancies and hiring. The tweets were collected using specific keywords such as 'Job Vacancy,' 'We are Hiring,' and 'We're Hiring'. The primary aim of this dataset is to facilitate the exploration of text pre-processing techniques and to test Natural Language Processing (NLP) skills. It also serves as a valuable resource for deriving insights into the job market from actual job postings and for analysing company and role requirements. The tweets were gathered using the snscrape Python library.

    Columns

    • ID: The unique identifier for each individual tweet.
    • Timestamp: The precise date and time when the tweet was posted.
    • User: The Twitter handle of the user responsible for posting the tweet.
    • Text: The actual content of the tweet itself.
    • Hashtag: Any hashtags that were included within the tweet.
    • Retweets: The total count of times the tweet had been retweeted at the point of scraping.
    • Likes: The total count of likes the tweet had accrued at the point of scraping.
    • Replies: The total count of replies to the tweet at the point of scraping.
    • Source: The application or device used to post the tweet.
    • Location: The location specified on the user's Twitter profile, if available.
    • Verified_Account: A Boolean value indicating whether the user's Twitter account was verified.
    • Followers: The number of followers the user had at the point the tweet was scraped.
    • Following: The number of accounts the user was following at the point the tweet was scraped.

    Distribution

    The dataset is provided in a CSV format and comprises 50,000 individual tweets or records.

    Usage

    This dataset is ideally suited for: * Text Pre-processing Practice: Users can experiment with various text cleaning and normalisation techniques. * Natural Language Processing (NLP) Skill Development: It serves as an excellent resource for developing and testing NLP models. * Job Market Analysis: Gaining insights into job market trends, popular roles, and hiring patterns based on social media data. * Company/Role Requirement Analysis: Examining the specific requirements or characteristics mentioned in job postings.

    Coverage

    The tweets in this dataset were collected between 1 January 2019 and 10 April 2023. The dataset's geographic coverage is global, as indicated by its listing. There are no specific notes on data availability for particular demographic groups or years beyond the stated collection period.

    License

    CC0

    Who Can Use It

    This dataset is intended for a variety of users, including: * Data Scientists and Analysts: For conducting text analysis, building predictive models, or extracting actionable insights. * NLP Researchers: For developing and refining algorithms related to text classification, topic modelling, or sentiment analysis on social media data. * Human Resources (HR) Professionals: To understand job market dynamics, identify hiring trends, or benchmark their own job postings. * Students and Educators: As a practical resource for learning and teaching about data analysis, social media data, and NLP.

    Dataset Name Suggestions

    • Job Vacancy Tweets
    • Social Media Job Postings
    • Hiring Tweets Dataset
    • Employment Tweets
    • Job Market Twitter Data

    Attributes

    Original Data Source: Job Vacancy Tweets

  8. o

    Data Centre Demand Profiles

    • ukpowernetworks.opendatasoft.com
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Data Centre Demand Profiles [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-data-centre-demand-profiles/
    Explore at:
    Dataset updated
    Mar 26, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This dataset shows the half-hourly load profiles of identified data centres within UK Power Networks' licence areas.

    The loads have been determined using actual demand data from connected sites within UK Power Networks' licence areas, over calendar year 2024.

    Loads are expressed proportionally, by comparing the half-hourly observed import power seen across the site's meter point(s), against the meter's maximum import capacity. Units for both measures are apparent power, in kilovolt amperes (kVA).

    To protect the identity of the sites, data points have been anonymised and only the site's voltage level information has been provided.

    Methodological Approach

    Over 100 operational data centre sites (and at least 10 per voltage level) were identified through internal desktop exercises and corroboration with external sources.

    After identifying these sites, their addresses and their MPAN(s) (Meter Point Administration Number) were identified using internal systems.

    Meter information for each of these connected demand sites were retrieved through half-hourly smart meter data. This includes both half-hourly meter data, and static data (such as the MPAN's maximum import capacity and voltage, the latter through the MPAN's Line Loss Factor Class Description).

    In cases where there are numerous meter points for a given data centre site, the observed import powers across all relevant meter points are summed, and compared against the sum total of maximum import capacity for the meters.

    The percentage utilisation for each half-hour for each data centre was determined via the following equation:

    % Utilisation_data centre site =

     SUM( P_MPAN half-hourly observed import)

     SUM( P_MPAN Maximum Import Capacity)

    Where P = Active Power in kilowatts (kW)To ensure the dataset includes only operational data centres, the dataset was then cleansed to exclude sites where utilisation was consistently at 0% across the year.Based on the MPAN's address and corroboration with other open data sources, a data centre type was derived: either enterprise (i.e. company-owned and operated), or co-located (i.e. one company owns the data centre, but other customers operate IT load in the premises as tenants).

    Each data centre site was then anonymised by removing any identifiers other than voltage level and UK Power Networks' view of the data centre type.

    Quality Control Statement

    The dataset is primarily built upon customer smart meter data for connected customer sites within the UK Power Networks' licence areas.

    The smart meter data that is used is sourced from external providers. While UK Power Networks does not control the quality of this data directly, these data have been incorporated into our models with careful validation and alignment.

    Any missing or bad data has been addressed though robust data cleaning methods, such as omission.

    Assurance Statement

    The dataset is generated through a manual process, conducted by the Distribution System Operator's Regional Development Team.

    The dataset will be reviewed quarterly - both in terms of the operational data centre sites identified, their maximum observed demands and their maximum import capacities - to assess any changes and determine if updates of demand specific profiles are necessary. Deriving the data centre type is a desktop-based process based on the MPAN's address and through corroboration with external, online sources. This process ensures that the dataset remains relevant and reflective of real-world data centre usage over time.

    There are sufficient data centre sites per voltage level to assure anonymity of data centre sites.

    Other Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/Download dataset information: Metadata (JSON)

  9. Household Survey on Information and Communications Technology 2023 - West...

    • pcbs.gov.ps
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2025). Household Survey on Information and Communications Technology 2023 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/733
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
    Time period covered
    2023 - 2024
    Area covered
    Gaza, West Bank, Palestine, Gaza Strip
    Description

    Abstract

    The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2023. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

    Geographic coverage

    Palestine, West Bank, Gaza strip

    Analysis unit

    Household, Individual

    Universe

    All Palestinian households and individuals (10 years and above) whose usual place of residence in 2023 was in the state of Palestine.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

    Sample Size The sample size is 8,040 households.

    Sampling Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

    Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, camps).

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

    Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

    Section III: Data on Individuals (10 years and above) about computer use, access to the Internet, possession of a mobile phone, information threats, and E-commerce.

    Cleaning operations

    Field Editing and Supervising

    • Data collection and coordination were carried out in the field according to the pre-prepared plan, where instructions, models and tools were available for fieldwork. • Audit process on the PC-Tablet is through the establishment of all automated rules and the office on the program to cover all the required controls according to the criteria specified. • For the privacy of Jerusalem (J1) data were collected in a paper questionnaire. Then the supervisor verifies the questionnaire in a formal and technical manner according to the pre-prepared audit rules. • Fieldwork visits was carried out by the project coordinator, supervisors and project management to check edited questionnaire and the performance of fieldworkers.

    Data Processing

    Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

    Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

    In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

    Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

    Response rate

    The response rate reached 83.7%.

    Sampling error estimates

    Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, there is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

    Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

    The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non-response cases. The total non-response rate reached 16.3%.

  10. F

    Cleaning and Sanitation Job Postings on Indeed in Germany

    • fred.stlouisfed.org
    json
    Updated Jul 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Cleaning and Sanitation Job Postings on Indeed in Germany [Dataset]. https://fred.stlouisfed.org/series/IHLIDXDETPCLSA
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jul 9, 2025
    License

    https://fred.stlouisfed.org/legal/#copyright-pre-approvalhttps://fred.stlouisfed.org/legal/#copyright-pre-approval

    Area covered
    Germany
    Description

    Graph and download economic data for Cleaning and Sanitation Job Postings on Indeed in Germany (IHLIDXDETPCLSA) from 2020-02-01 to 2025-07-04 about cleaning, jobs, Germany, and USA.

  11. h

    clinical-field-mappings

    • huggingface.co
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago Silva (2025). clinical-field-mappings [Dataset]. https://huggingface.co/datasets/tsilva/clinical-field-mappings
    Explore at:
    Dataset updated
    May 8, 2025
    Authors
    Tiago Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🚑 Clinical Field Mappings for Healthcare Systems

    This synthetic dataset provides a wide variety of alternative names for clinical database fields, mapping them to standardized targets for healthcare data normalization.

    Using LLMs, we generated and validated thousands of plausible variations, including misspellings, abbreviations, country-specific nuances, and common real-world typos.

    This dataset is perfect for training models that need to standardize, clean, or map heterogeneous healthcare data schemas into unified, normalized formats.

    Applications include: - Data cleaning and ETL pipelines for clinical databases - Fine-tuning LLMs for schema matching - Clinical data interoperability projects - Zero-shot field matching research

    The dataset is machine-generated and validated with LLM feedback loops to ensure high-quality mappings.

  12. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Manuscript published in Scientific Data with DOI .

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The `Coauthorship_Networ...

  13. f

    Preparing a satellite-operator-year panel from UCS and Space-Track data

    • middlebury.figshare.com
    zip
    Updated Sep 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Rao; Ethan Berner; Gordon Lewis (2023). Preparing a satellite-operator-year panel from UCS and Space-Track data [Dataset]. http://doi.org/10.57968/Middlebury.23982468.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 2, 2023
    Dataset provided by
    Middlebury
    Authors
    Akhil Rao; Ethan Berner; Gordon Lewis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Preparing a satellite-operator-year panel from UCS and space-track dataThis repository contains the data and R code needed to reproduce the dataset UCS-JSpOC-soy-panel-22.csv. This dataset combines and cleans data from the Union of Concerned Scientists and Space-Track.org to create a panel of satellites, operators, and years. This dataset is used in the paper "Oligopoly competition between satellite constellations will reduce economic welfare from orbit use". The final dataset can also be downloaded from the replication files for that paper: https://doi.org/10.57968/Middlebury.23816994.v1A "living" version of this repository can be found at: https://github.com/akhilrao/orbital-ownership-data# Repository structure* /UCS data contains Excel and CSV data files from the Union of Concerned Scientists, as well as output files generated from data cleaning. You can find the UCS Satellite Database here: https://www.ucsusa.org/resources/satellite-database . Historical data was obtained from Dr. Teri Grimwood.* /Space-Track data contains JSON data from Space-Track.org, files to help identify operator names for harmonization in UCS_text_cleaner.R, and output generated from cleaning and merging data. * API queries to generate the JSON files can be found in json_cleaned_script.R. They are restated below for convenience. These queries were run on January 1, 2023 to produce the data used in "Oligopoly competition between satellite constellations will reduce economic welfare from orbit use". * 33999/OBJECT_TYPE/PAYLOAD/orderby/INTLDES asc/emptyresult/show* /Current R scripts contains R scripts to process the data. * combined_scripts.R loads and cleans UCS data. It takes the raw CSV files from /UCS data as input and produces UCS_Combined_Data.csv as output. * UCS_text_cleaner.R harmonizes various text fields in the UCS data, including operator and owner names. Best efforts were made to ensure correctness and completeness, but some gaps may remain. * json_cleaned_script.R loads and cleans Space-Track data, and merges it with the cleaned and combined UCS data. * panel_builder.R uses the cleaned and merged files to construct the satellite-operator-year panel dataset with annual satellite histories and operator information. The logic behind the dataset construction approach is described in this blog post: https://akhilrao.github.io/blog//data/2020/08/20/build_stencil_cut/* /Output_figures contains figures produced by the scripts. Some are diagnostic, some are just interesting.* /Output_data contains the final data outputs.* /data-cleaning-notes contains Excel and CSV files used to assist in harmonizing text fields in UCS_text_cleaner.R. They are included here for completeness.# Creating the datasetTo reproduce the UCS-JSpOC-soy-panel-22.csv dataset:1. Ensure R is installed along with the required packages2. Run the scripts in /Current R scripts in the following order: * combined_scripts.R (this will call UCS_text_cleaner.R) * json_cleaned_script.R * panel_builder.R3. The output file UCS-JSpOC-soy-panel-22.csv, along with several intermediate files used to create it, will be generated in /Output data

  14. o

    Data and code for "Plastic bag bans and fees reduce harmful bag litter on...

    • openicpsr.org
    delimited
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Papp; Kimberly Oremus (2024). Data and code for "Plastic bag bans and fees reduce harmful bag litter on shorelines" [Dataset]. http://doi.org/10.3886/E200661V3
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Apr 14, 2024
    Dataset provided by
    University of Delaware
    Columbia University
    Authors
    Anna Papp; Kimberly Oremus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code and data for "Plastic bag bans and fees reduce harmful bag litter on shorelines " by Anna Papp and Kimberly Oremus.Please see included README file for details: This folder includes code and data to fully replicate Figures 1-5. In addition, the folder also includes instructions to rerun data cleaning steps. Last modified: March 6, 2025For any questions, please reach out to ap3907@columbia.edu._Code (replication/code):To replicate main figures, run each file for each main figure: - 1_figure1.R- 1_figure2.R- 1_figure3.R - 1_figure4.R- 1_figure5.R Update the home directory to match where the directory is saved ("replication" folder) in this file before running it. The code will require you to install packages (see note on versions below).To replicate entire data cleaning pipeline:- First download all required data (explained in Data section below). - Run code in code/0_setup folder (refer to separate README file)._ R-Version and Package VersionsThe project was developed and executed using:- R version: 4.0.0 (2024-04-24)- Platform: macOS 13.5 Code was developed and main figures were created using the following versions: - data.table: 1.14.2- dplyr: 1.1.4- readr: 2.1.2- tidyr: 1.2.0- broom: 0.7.12- stringr: 1.5.1- lubridate: 1.7.9- raster: 3.5.15- sf: 1.0.7- readxl: 1.4.0- cobalt: 4.4.1.9002- spdep: 1.2.3- ggplot2: 3.4.4- PNWColors: 0.1.0- grid: 4.0.0- gridExtra: 2.3- ggpubr: 0.4.0- knitr: 1.48- zoo: 1.8.12 - fixest: 0.11.2- lfe: 2.8.7.1 - did: 2.1.2- didimputation: 0.3.0 - DIDmultiplegt: 0.1.0- DIDmultiplegtDYN: 1.0.15- scales: 1.2.1 - usmap: 0.6.1 - tigris: 2.0.1 - dotwhisker: 0.7.4_Data Processed data files are provided to replicate main figures. To replicate from raw data, follow the instructions below.Policies (needs to be recreated or email for version): Compiled from bagtheban.com/in-your-state/, rila.org/retail-compliance-center/consumer-bag-legislation, baglaws.com, nicholasinstitute.duke.edu/plastics-policy-inventory, and wikipedia.org/wiki/Plastic_bag_bans_in_the_United_States; and massgreen.org/plastic-bag-legislation.html and cawrecycles.org/list-of-local-bag-bans to confirm legislation in Massachusetts and California.TIDES (needs to be downloaded for full replication): Download cleanup data for the United States from Ocean Conservancy (coastalcleanupdata.org/reports). Download files for 2000-2009, 2010-2014, and then each separate year from 2015 until 2023. Save files in the data/tides directory, as year.csv (and 2000-2009.csv, 2010-2014.csv) Also download entanglement data for each year (2016-2023) separately in a file called data/tides/entanglement (each file should be called 'entangled-animals-united-states_YEAR.csv').Shapefiles (needs to be downloaded for full replication): Download shapefiles for processing cleanups and policies. Download county shapefiles from the US Census Bureau; save files in the data/shapefiles directory, county shapefile should be in folder called county (files called cb_2018_us_county_500k.shp). Download TIGER Zip Code tabulation areas from the US Census Bureau (through data.gov); save files in the data/shapefiles directory, zip codes shapefile folder and files should be called tl_2019_us_zcta510.Other: Helper files with US county and state fips codes, lists of US counties and zip codes in data/other directory, provided in the directory except as follows. Download zip code list and 2020 IRS population data from United States zip codes and save as uszipcodes.csv in data/other directory. Download demographic characteristics of zip codes from Social Explorer and save as raw_zip_characteristics.csv in data/other directory.Refer to the .txt files in each data folder to ensure all necessary files are downloaded.

  15. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • paperswithcode.com
    • +5more
    application/rdfxml +5
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
    Explore at:
    application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  16. i

    Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102...

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    • +1more
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Statistical Office (NSO) (2023). Integrated Household Panel Survey 2010-2013-2016-2019 (Long-Term Panel, 102 EAs) - Malawi [Dataset]. https://datacatalog.ihsn.org/catalog/8702
    Explore at:
    Dataset updated
    Jul 19, 2023
    Dataset authored and provided by
    National Statistical Office (NSO)
    Time period covered
    2010 - 2019
    Area covered
    Malawi
    Description

    Abstract

    The 2016 Integrated Household Panel Survey (IHPS) was launched in April 2016 as part of the Malawi Fourth Integrated Household Survey fieldwork operation. The IHPS 2016 targeted 1,989 households that were interviewed in the IHPS 2013 and that could be traced back to half of the 204 enumeration areas that were originally sampled as part of the Third Integrated Household Survey (IHS3) 2010/11. The 2019 IHPS was launched in April 2019 as part of the Malawi Fifth Integrated Household Survey fieldwork operations targeting the 2,508 households that were interviewed in 2016. The panel sample expanded each wave through the tracking of split-off individuals and the new households that they formed. Available as part of this project is the IHPS 2019 data, the IHPS 2016 data as well as the rereleased IHPS 2010 & 2013 data including only the subsample of 102 EAs with updated panel weights. Additionally, the IHPS 2016 was the first survey that received complementary financial and technical support from the Living Standards Measurement Study – Plus (LSMS+) initiative, which has been established with grants from the Umbrella Facility for Gender Equality Trust Fund, the World Bank Trust Fund for Statistical Capacity Building, and the International Fund for Agricultural Development, and is implemented by the World Bank Living Standards Measurement Study (LSMS) team, in collaboration with the World Bank Gender Group and partner national statistical offices. The LSMS+ aims to improve the availability and quality of individual-disaggregated household survey data, and is, at start, a direct response to the World Bank IDA18 commitment to support 6 IDA countries in collecting intra-household, sex-disaggregated household survey data on 1) ownership of and rights to selected physical and financial assets, 2) work and employment, and 3) entrepreneurship – following international best practices in questionnaire design and minimizing the use of proxy respondents while collecting personal information. This dataset is included here.

    Geographic coverage

    National coverage

    Analysis unit

    • Households
    • Individuals
    • Children under 5 years
    • Consumption expenditure commodities/items
    • Communities
    • Agricultural household/ Holder/ Crop

    Universe

    The IHPS 2016 and 2019 attempted to track all IHPS 2013 households stemming from 102 of the original 204 baseline panel enumeration areas as well as individuals that moved away from the 2013 dwellings between 2013 and 2016 as long as they were neither servants nor guests at the time of the IHPS 2013; were projected to be at least 12 years of age and were known to be residing in mainland Malawi but excluding those in Likoma Island and in institutions, including prisons, police compounds, and army barracks.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A sub-sample of IHS3 2010 sample enumeration areas (EAs) (i.e. 204 EAs out of 768 EAs) was selected prior to the start of the IHS3 field work with the intention to (i) to track and resurvey these households in 2013 in accordance with the IHS3 fieldwork timeline and as part of the Integrated Household Panel Survey (IHPS 2013) and (ii) visit a total of 3,246 households in these EAs twice to reduce recall associated with different aspects of agricultural data collection. At baseline, the IHPS sample was selected to be representative at the national, regional, urban/rural levels and for each of the following 6 strata: (i) Northern Region - Rural, (ii) Northern Region - Urban, (iii) Central Region - Rural, (iv) Central Region - Urban, (v) Southern Region - Rural, and (vi) Southern Region - Urban. The IHPS 2013 main fieldwork took place during the period of April-October 2013, with residual tracking operations in November-December 2013.

    Given budget and resource constraints, for the IHPS 2016 the number of sample EAs in the panel was reduced to 102 out of the 204 EAs. As a result, the domains of analysis are limited to the national, urban and rural areas. Although the results of the IHPS 2016 cannot be tabulated by region, the stratification of the IHPS by region, urban and rural strata was maintained. The IHPS 2019 tracked all individuals 12 years or older from the 2016 households.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Cleaning operations

    Data Entry Platform To ensure data quality and timely availability of data, the IHPS 2019 was implemented using the World Bank’s Survey Solutions CAPI software. To carry out IHPS 2019, 1 laptop computer and a wireless internet router were assigned to each team supervisor, and each enumerator had an 8–inch GPS-enabled Lenovo tablet computer that the NSO provided. The use of Survey Solutions allowed for the real-time availability of data as the completed data was completed, approved by the Supervisor and synced to the Headquarters server as frequently as possible. While administering the first module of the questionnaire the enumerator(s) also used their tablets to record the GPS coordinates of the dwelling units. Geo-referenced household locations from that tablet complemented the GPS measurements taken by the Garmin eTrex 30 handheld devices and these were linked with publically available geospatial databases to enable the inclusion of a number of geospatial variables - extensive measures of distance (i.e. distance to the nearest market), climatology, soil and terrain, and other environmental factors - in the analysis.

    Data Management The IHPS 2019 Survey Solutions CAPI based data entry application was designed to stream-line the data collection process from the field. IHPS 2019 Interviews were mainly collected in “sample” mode (assignments generated from headquarters) and a few in “census” mode (new interviews created by interviewers from a template) for the NSO to have more control over the sample. This hybrid approach was necessary to aid the tracking operations whereby an enumerator could quickly create a tracking assignment considering that they were mostly working in areas with poor network connection and hence could not quickly receive tracking cases from Headquarters.

    The range and consistency checks built into the application was informed by the LSMS-ISA experience with the IHS3 2010/11, IHPS 2013 and IHPS 2016. Prior programming of the data entry application allowed for a wide variety of range and consistency checks to be conducted and reported and potential issues investigated and corrected before closing the assigned enumeration area. Headquarters (the NSO management) assigned work to the supervisors based on their regions of coverage. The supervisors then made assignments to the enumerators linked to their supervisor account. The work assignments and syncing of completed interviews took place through a Wi-Fi connection to the IHPS 2019 server. Because the data was available in real time it was monitored closely throughout the entire data collection period and upon receipt of the data at headquarters, data was exported to Stata for other consistency checks, data cleaning, and analysis.

    Data Cleaning The data cleaning process was done in several stages over the course of fieldwork and through preliminary analysis. The first stage of data cleaning was conducted in the field by the field-based field teams utilizing error messages generated by the Survey Solutions application when a response did not fit the rules for a particular question. For questions that flagged an error, the enumerators were expected to record a comment within the questionnaire to explain to their supervisor the reason for the error and confirming that they double checked the response with the respondent. The supervisors were expected to sync the enumerator tablets as frequently as possible to avoid having many questionnaires on the tablet, and to enable daily checks of questionnaires. Some supervisors preferred to review completed interviews on the tablets so they would review prior to syncing but still record the notes in the supervisor account and reject questionnaires accordingly. The second stage of data cleaning was also done in the field, and this resulted from the additional error reports generated in Stata, which were in turn sent to the field teams via email or DropBox. The field supervisors collected reports for their assignments and in coordination with the enumerators reviewed, investigated, and collected errors. Due to the quick turn-around in error reporting, it was possible to conduct call-backs while the team was still operating in the EA when required. Corrections to the data were entered in the rejected questionnaires and sent back to headquarters.

    The data cleaning process was done in several stages over the course of the fieldwork and through preliminary analyses. The first stage was during the interview itself. Because CAPI software was used, as enumerators asked the questions and recorded information, error messages were provided immediately when the information recorded did not match previously defined rules for that variable. For example, if the education level for a 12 year old respondent was given as post graduate. The second stage occurred during the review of the questionnaire by the Field Supervisor. The Survey Solutions software allows errors to remain in the data if the enumerator does not make a correction. The enumerator can write a comment to explain why the data appears to be incorrect. For example, if the previously mentioned 12 year old was, in fact, a genius who had completed graduate studies. The next stage occurred when the data were transferred to headquarters where the NSO staff would again review the data for errors and verify the comments from the

  17. Household Survey on Information and Communications Technology 2014 - West...

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    Updated Oct 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2021). Household Survey on Information and Communications Technology 2014 - West Bank and Gaza [Dataset]. https://datacatalog.ihsn.org/catalog/9840
    Explore at:
    Dataset updated
    Oct 14, 2021
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttp://pcbs.gov.ps/
    Time period covered
    2014
    Area covered
    Gaza, West Bank, Palestine, Gaza Strip
    Description

    Abstract

    Within the frame of PCBS' efforts in providing official Palestinian statistics in the different life aspects of Palestinian society and because the wide spread of Computer, Internet and Mobile Phone among the Palestinian people, and the important role they may play in spreading knowledge and culture and contribution in formulating the public opinion, PCBS conducted the Household Survey on Information and Communications Technology, 2014.

    The main objective of this survey is to provide statistical data on Information and Communication Technology in the Palestine in addition to providing data on the following: - Prevalence of computers and access to the Internet. - Study the penetration and purpose of Technology use.

    Geographic coverage

    Palestine (West Bank and Gaza Strip), type of locality (urban, rural, refugee camps) and governorate.

    Analysis unit

    • Household.
    • Persons 10 years and over .

    Universe

    All Palestinian households and individuals whose usual place of residence in Palestine with focus on persons aged 10 years and over in year 2014.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Sampling Frame The sampling frame consists of a list of enumeration areas adopted in the Population, Housing and Establishments Census of 2007. Each enumeration area has an average size of about 124 households. These were used in the first phase as Preliminary Sampling Units in the process of selecting the survey sample.

    Sample Size The total sample size of the survey was 7,268 households, of which 6,000 responded.

    Sample Design The sample is a stratified clustered systematic random sample. The design comprised three phases:

    Phase I: Random sample of 240 enumeration areas. Phase II: Selection of 25 households from each enumeration area selected in phase one using systematic random selection. Phase III: Selection of an individual (10 years or more) in the field from the selected households; KISH TABLES were used to ensure indiscriminate selection.

    Sample Strata Distribution of the sample was stratified by: 1- Governorate (16 governorates, J1). 2- Type of locality (urban, rural and camps).

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

    Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

    Section III: Data on persons (aged 10 years and over) about computer use, access to the Internet and possession of a mobile phone.

    Cleaning operations

    Preparation of Data Entry Program: This stage included preparation of the data entry programs using an ACCESS package and defining data entry control rules to avoid errors, plus validation inquiries to examine the data after it had been captured electronically.

    Data Entry: The data entry process started on the 8th of May 2014 and ended on the 23rd of June 2014. The data entry took place at the main PCBS office and in field offices using 28 data clerks.

    Editing and Cleaning procedures: Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.

    Response rate

    Response Rates: 79%

    Sampling error estimates

    There are many aspects of the concept of data quality; this includes the initial planning of the survey to the dissemination of the results and how well users understand and use the data. There are three components to the quality of statistics: accuracy, comparability, and quality control procedures.

    Checks on data accuracy cover many aspects of the survey and include statistical errors due to the use of a sample, non-statistical errors resulting from field workers or survey tools, and response rates and their effect on estimations. This section includes:

    Statistical Errors Data of this survey may be affected by statistical errors due to the use of a sample and not a complete enumeration. Therefore, certain differences can be expected in comparison with the real values obtained through censuses. Variances were calculated for the most important indicators.

    Variance calculations revealed that there is no problem in disseminating results nationally or regionally (the West Bank, Gaza Strip), but some indicators show high variance by governorate, as noted in the tables of the main report.

    Non-Statistical Errors Non-statistical errors are possible at all stages of the project, during data collection or processing. These are referred to as non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, and practical and theoretical training took place during the training course. Training manuals were provided for each section of the questionnaire, along with practical exercises in class and instructions on how to approach respondents to reduce refused cases. Data entry staff were trained on the data entry program, which was tested before starting the data entry process.

    Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.

    The sources of non-statistical errors can be summarized as: 1. Some of the households were not at home and could not be interviewed, and some households refused to be interviewed. 2. In unique cases, errors occurred due to the way the questions were asked by interviewers and respondents misunderstood some of the questions.

  18. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  19. s

    Drainage Gully Cleaning Programme DCC

    • data.smartdublin.ie
    • gimi9.com
    • +3more
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Drainage Gully Cleaning Programme DCC [Dataset]. https://data.smartdublin.ie/dataset/drainage-gully-cleaning-programme
    Explore at:
    Dataset updated
    Jun 15, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Schedule and Monitor of Gully Cleaning for Dublin City These datasets show the gully cleaning statistics from 2004 to September 14th 2011. It consists attached 6 No. Excel Spreadsheets with the datasets from the Daily Returns section of the Gully Cleaning Application and one dataset from the Gully Repairs section of the gully application. They are divided into the five Dublin City Council administrative areas; Central, North Central, North West, Southeast, South Central. There is also a dataset containing details of all Gully repairs pending (all areas included).The datasets cover all Daily Returns since the gully cleaning programme commenced in 2004. Daily Returns are lists of the work that the gully cleaning crews carry out daily. All gullies on a street are cleaned where possible. A list of Omissions is recorded where some gullies may not have been cleaned due to lack of access or other reasons. Also, the gullies that required repair were noted. The Daily Returns datasets record only the number of gullies requiring repair on a particular street, not the details of the repair. Information in the fields is as follows: .Road name: street name or laneway denoted by nearest house or lamp post etc. If a road name is followed by the letters pl in capital letters than it means that either this road or a section of this road has been placed on the priority list due to a history of flooding or a higher potential of the gully blocking due to location etc. If a road name is followed by a number of zeros in the gullies inspected - gullies cleaned columns etc then it is very probable that this road was travelled during heavy rain as part of our flood zones and there was no flooding noted along this road at the time of travelling. Gullies inspected: number of gullies inspected along road/lane .A road name followed by lower case road names denotes a road that is part of more than one area in our gully cleaning areas and these lower case names denote the starting point and finishing point for the crews working in the particular area i.e. Howth Road All Saints Rd-Fairview denotes that the section of the Howth road between all saints road and Fairview are within the area that the crew have been asked to work in. Gullies cleaned :number of gullies cleaned from total inspected .Gully omissions :number of gullies missed i.e. Unable to put boom or shovel into gully pot due to parked cars / unable to lift grids / hoarding over gullies etc .Gully repairs: number of repairs based on inspections-note not all repairs prevent the gully from being cleaned. Comments box: this box is used to provide any additional information that may be of benefit and it can be noted that results of work carried out by the mini jet is placed in this box.

  20. OpenCon Application Data

    • figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCon 2015; SPARC; Right to Research Coalition (2023). OpenCon Application Data [Dataset]. http://doi.org/10.6084/m9.figshare.1512496.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCon 2015; SPARC; Right to Research Coalition
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OpenCon 2015 Application Open Data

    The purpose of this document is to accompany the public release of data collected from OpenCon 2015 applications.Download & Technical Information The data can be downloaded in CSV format from GitHub here: https://github.com/RightToResearch/OpenCon-2015-Application-Data The file uses UTF8 encoding, comma as field delimiter, quotation marks as text delimiter, and no byte order mark.

    License and Requests

    This data is released to the public for free and open use under a CC0 1.0 license. We have a couple of requests for anyone who uses the data. First, we’d love it if you would let us know what you are doing with it, and share back anything you develop with the OpenCon community (#opencon / @open_con ). Second, it would also be great if you would include a link to the OpenCon 2015 website (www.opencon2015.org) wherever the data is used. You are not obligated to do any of this, but we’d appreciate it!

    Data Fields

    Unique ID

    This is a unique ID assigned to each applicant. Numbers were assigned using a random number generator.

    Timestamp

    This was the timestamp recorded by google forms. Timestamps are in EDT (Eastern U.S. Daylight Time). Note that the application process officially began at 1:00pm EDT June 1 ended at 6:00am EDT on June 23. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. [a]

    Gender

    Mandatory. Choose one from list or fill-in other. Options provided: Male, Female, Other (fill in).

    Country of Nationality

    Mandatory. Choose one option from list.

    Country of Residence

    Mandatory. Choose one option from list.

    What is your primary occupation?

    Mandatory. Choose one from list or fill-in other. Options provided: Undergraduate student; Masters/professional student; PhD candidate; Faculty/teacher; Researcher (non-faculty); Librarian; Publisher; Professional advocate; Civil servant / government employee; Journalist; Doctor / medical professional; Lawyer; Other (fill in).

    Select the option below that best describes your field of study or expertise

    Mandatory. Choose one option from list.

    What is your primary area of interest within OpenCon’s program areas?

    Mandatory. Choose one option from list. Note: for the first approximately 24 hours the options were listed in this order: Open Access, Open Education, Open Data. After that point, we set the form to randomize the order, and noticed an immediate shift in the distribution of responses.

    Are you currently engaged in activities to advance Open Access, Open Education, and/or Open Data?

    Mandatory. Choose one option from list.

    Are you planning to participate in any of the following events this year?

    Optional. Choose all that apply from list. Multiple selections separated by semi-colon.

    Do you have any of the following skills or interests?

    Mandatory. Choose all that apply from list or fill-in other. Multiple selections separated by semi-colon. Options provided: Coding; Website Management / Design; Graphic Design; Video Editing; Community / Grassroots Organizing; Social Media Campaigns; Fundraising; Communications and Media; Blogging; Advocacy and Policy; Event Logistics; Volunteer Management; Research about OpenCon's Issue Areas; Other (fill-in).

    Data Collection & Cleaning

    This data consists of information collected from people who applied to attend OpenCon 2015. In the application form, questions that would be released as Open Data were marked with a caret (^) and applicants were asked to acknowledge before submitting the form that they understood that their responses to these questions would be released as such. The questions we released were selected to avoid any potentially sensitive personal information, and to minimize the chances that any individual applicant can be positively identified. Applications were formally collected during a 22 day period beginning on June 1, 2015 at 13:00 EDT and ending on June 23 at 06:00 EDT. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. Applications were collected using a Google Form embedded at http://www.opencon2015.org/attend, and the shortened bit.ly link http://bit.ly/AppsAreOpen was promoted through social media. The primary work we did to clean the data focused on identifying and eliminating duplicates. We removed all duplicate applications that had matching e-mail addresses and first and last names. We also identified a handful of other duplicates that used different e-mail addresses but were otherwise identical. In cases where duplicate applications contained any different information, we kept the information from the version with the most recent timestamp. We made a few minor adjustments in the country field for cases where the entry was obviously an error (for example, electing a country listed alphabetically above or below the one indicated elsewhere in the application). We also removed one potentially offensive comment (which did not contain an answer to the question) from the Gender field and replaced it with “Other.”

    About OpenCon

    OpenCon 2015 is the student and early career academic professional conference on Open Access, Open Education, and Open Data and will be held on November 14-16, 2015 in Brussels, Belgium. It is organized by the Right to Research Coalition, SPARC (The Scholarly Publishing and Academic Resources Coalition), and an Organizing Committee of students and early career researchers from around the world. The meeting will convene students and early career academic professionals from around the world and serve as a powerful catalyst for projects led by the next generation to advance OpenCon's three focus areas—Open Access, Open Education, and Open Data. A unique aspect of OpenCon is that attendance at the conference is by application only, and the majority of participants who apply are awarded travel scholarships to attend. This model creates a unique conference environment where the most dedicated and impactful advocates can attend, regardless of where in the world they live or their access to travel funding. The purpose of the application process is to conduct these selections fairly. This year we were overwhelmed by the quantity and quality of applications received, and we hope that by sharing this data, we can better understand the OpenCon community and the state of student and early career participation in the Open Access, Open Education, and Open Data movements.

    Questions

    For inquires about the OpenCon 2015 Application data, please contact Nicole Allen at nicole@sparc.arl.org.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Datasimple (2025). Data Science Career Opportunities (USA) [Dataset]. https://www.opendatabay.com/data/ai-ml/6d1c5965-8fb2-4749-a8bd-f1c40861b401

Data Science Career Opportunities (USA)

Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered
Data Science and Analytics, United States
Description

This dataset provides valuable insights into the US data science job market, containing detailed job listings scraped from the Indeed web portal on 20th November 2022. It is ideal for those seeking to understand job trends, analyse salary expectations, or develop skills in data analysis, machine learning, and natural language processing. The dataset's purpose is to offer a snapshot of available positions across various data science roles, including data scientists, machine learning engineers, and business analysts. It serves as a rich resource for exploratory data analysis, feature engineering, and predictive modelling tasks.

Columns

  • Title: The job title of the listed position.
  • Company: The hiring company posting the job.
  • Location: The geographic location of the job within the US.
  • Rating: The rating associated with the job or company.
  • Date: Indicates how long the job had been posted prior to 20th November 2022.
  • Salary: The salary information provided in US Dollars ($). Please note that many entries in this column may be missing as salary details are often not disclosed in job listings.
  • Description: A brief summary description of the job.
  • Links: The direct link to the original job posting on the Indeed platform.
  • Descriptions: The full-length description of the job, encompassing all details found in the complete job posting.

Distribution

This dataset is provided as a single data file, typically in CSV format. It comprises 1200 rows (records) and 9 distinct columns. The file name is data_science_jobs_indeed_us.csv.

Usage

This dataset is perfectly suited for a variety of analytical tasks and applications: * Data Cleaning and Preparation: Practise handling missing values, especially in the 'Salary' column. * Exploratory Data Analysis (EDA): Discover trends in job titles, company types, and locations. * Feature Engineering: Extract new features from the 'Descriptions' column, such as required skills, education levels, or experience. * Classification and Clustering: Develop models for salary prediction, or perform skill clustering analysis to guide curriculum development. * Text Processing and Natural Language Processing (NLP): Analyse job descriptions to identify common skill demands or industry buzzwords.

Coverage

The dataset's geographic scope is limited to job postings within the United States. All data was collected on 20th November 2022, with the 'Date' column providing information on how long each job had been active before this date. The dataset covers a wide range of data science positions, including roles such as data scientist, machine learning engineer, data engineer, business analyst, and data science manager. It is important to note the presence of many missing entries in the 'Salary' column, reflecting common data availability challenges in job listings.

License

CCO

Who Can Use It

This dataset is an excellent resource for: * Aspiring Data Scientists and Machine Learning Engineers: To sharpen their data cleaning, EDA, and model deployment skills. * Educators and Curriculum Developers: To inform and guide the development of relevant data science and analytics courses based on real-world job market demands. * Job Seekers: To understand the current landscape of data science roles, required skills, and potential salary ranges. * Researchers and Analysts: To glean insights into labour market trends in the data science domain. * Human Resources Professionals: To benchmark job roles, skill requirements, and compensation within the industry.

Dataset Name Suggestions

  • Indeed US Data Science Job Insights
  • US Data Science Job Market Analysis
  • Data Professional Job Postings (Indeed USA)
  • Data Science Career Opportunities (USA)

Attributes

Original Data Source: Data Science Job Postings (Indeed USA)

Search
Clear search
Close search
Google apps
Main menu