21 datasets found
  1. NYC Yellow Taxi Trip Data

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elemento (2021). NYC Yellow Taxi Trip Data [Dataset]. https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data
    Explore at:
    zip(1915626894 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Elemento
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    New York
    Description

    Context

    New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. You can access it here. Now, the TLC primarily keeps and manages data for 4 different types of vehicles: - Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged. - Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides. - For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

    Complimentary Kernel

    • I have made a Kernel especially for this dataset, which uses Clustering, Regression, and Time-Series techniques for this dataset. You can check it out here.

    Important Points

    • In this dataset, we are considering only the Yellow Taxis Data, for the months of Jan 2015 & Jan-mar 2016.
    • If you go over to the website of NYC TLC, and download any of the CSV files, you will find a different format of these files. This is because, the TLC regularly adds more data, alongside updating the existing one.
    • One of the key changes that they have made to their data is that, instead of providing the pickup & dropoff coordinates, they have divided the NYC into regions and indexed those regions, and in the CSV files, they have provided these indices.
    • Due to this reason only, I have made this dataset using the previous version of the CSV files. This dataset allows me to practice my clustering knowledge alongside my time-series knowledge.
    • If you want to leave out the clustering part, then just go over to their website, and download the new CSV files.

    Attributes

    ...

    Field NameDescription
    VendorID A code indicating the TPEP provider that provided the record.
    1. Creative Mobile Technologies
    2. VeriFone Inc.
    tpep_pickup_datetimeThe date and time when the meter was engaged.
    tpep_dropoff_datetimeThe date and time when the meter was disengaged.
    Passenger_countThe number of passengers in the vehicle. This is a driver-entered value.
    Trip_distanceThe elapsed trip distance in miles reported by the taximeter.
    Pickup_longitudeLongitude where the meter was engaged.
    Pickup_latitudeLatitude where the meter was engaged.
    RateCodeIDThe final rate code in effect at the end of the trip.
    1. Standard rate
    2. JFK
    3. Newark
    4. Nassau or Westchester
    5. Negotiated fare
    6. Group ride
    Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor,
    aka “store and forward,” because the vehicle did not have a connection to the server.
    Y= store and forward trip
    N= not a store and forward trip
    Dropoff_longitudeLongitude where the meter was disengaged.
    Dropoff_ latitudeLatitude where the meter was disengaged.
    Payment_typeA numeric code signifying how the passenger paid for the trip.
    1. Credit card
    2. Cash
    3. No charge
    4. Dispute
    5. Unknown
    6. Voided trip
    Fare_amountThe time-and-distance fare calculated by the meter.
    ExtraMiscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
    MTA_tax0.50 MTA tax that is automatically triggered based on the metered rate in use.
    Improvement_surcharge0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
  2. d

    2023 Yellow Taxi Trip Data

    • catalog.data.gov
    • data.cityofnewyork.us
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2024). 2023 Yellow Taxi Trip Data [Dataset]. https://catalog.data.gov/dataset/2023-yellow-taxi-trip-data
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    data.cityofnewyork.us
    Description

    These records are generated from the trip record submissions made by yellow taxi Technology Service Providers (TSPs). Each row represents a single trip in a yellow taxi. The trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off taxi zone locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

  3. C

    Taxi Trips (2013-2023)

    • data.cityofchicago.org
    • catalog.data.gov
    csv, xlsx, xml
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2024). Taxi Trips (2013-2023) [Dataset]. https://data.cityofchicago.org/Transportation/Taxi-Trips-2013-2023-/wrvz-psew
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    City of Chicago
    Description

    This dataset ends with 2023. Please see the Featured Content link below for the dataset that starts in 2024.

    Taxi trips from 2013 to 2023 reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes.

    Due to the data reporting process, not all trips are reported but the City believes that most are.

  4. Uber NYC for-hire vehicles trip data (2021)

    • kaggle.com
    zip
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shuheng_mo (2023). Uber NYC for-hire vehicles trip data (2021) [Dataset]. https://www.kaggle.com/datasets/shuhengmo/uber-nyc-forhire-vehicles-trip-data-2021
    Explore at:
    zip(4539471170 bytes)Available download formats
    Dataset updated
    Feb 2, 2023
    Authors
    shuheng_mo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    In Newyork City, all taxi vehicles are managed by TLC (Taxi and Limousine Commission), here is a brief description about TLC:

    The New York City Taxi and Limousine Commission (TLC), created in 1971, is the agency responsible for licensing and regulating New York City's Medallion (Yellow) taxi cabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles. The Commission's Board consists of nine members, eight of whom are unsalaried Commissioners. The salaried Chair/ Commissioner presides over regularly scheduled public commission meetings and is the head of the agency, which maintains a staff of approximately 600 TLC employees. Over 200,000 TLC licensees complete approximately 1,000,000 trips each day. To operate for hire, drivers must first undergo a background check, have a safe driving record, and complete 24 hours of driver training. TLC-licensed vehicles are inspected for safety and emissions at TLC's Woodside Inspection Facility.

    Now NYC TLC has released its Trip Record data to public for research and study purposes. There are three main taxi types in NYC: Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (street hail), but now they may also be hailed using an e-hail app like Curb or Arro. Yellow taxis are the only vehicles permitted to respond to a street hail from a passenger in all five boroughs. Green taxis, also known as boro taxis and street-hail liveries, were introduced in August of 2013 to improve taxi service and availability in the boroughs. Green taxis may respond to street hails, but only in the areas indicated in green on the map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs). FHV data includes trip data from high-volume for-hire vehicle bases (bases for companies dispatching 10,000+ trip per day, meaning Uber, Lyft, Via, and Juno), community livery bases, luxury limousine bases, and black car bases. Uber as one of the biggest ride-hailing services providers, its trip records are collected in High Volume For-Hire Vehicle Trip Records as well.

    Based on this dataset, there are some business goals we want to achieve to improve Uber's ride-hailing service: Exploratory data analysis, research data fhvhv_tripdata_2021 and figure out underlying trip patterns in 2021. Based on fhvhv_tripdata_2021 and weather data, build predict model to predict the peak footfall. Try explore Uber's user portrait in NYC (which orders are urgent and what kind of users should be given higher priorities?)

    Some useful tips about this dataset: - The trip data of the for-hire vehicles named like fhvhv_tripdata_2021-0X.parquet - Columns' description of the trip data please refer to data_dictionary_trip_records_hvfhs.pdf. - taxi_zones folder contains the geospatial data of NYC taxi zones (geopandas would be helpful). - taxi_zone_lookup.csv stores taxi zones zip code and other relevant information. - nyc 2021-01-01 to 2021-12-31.csv record the weather data of year 2021,taxi+_zone_lookup.csv stored the zone information of all taxi, data file end with .parquet could be processed by pyarrow package and convert to Pandas DataFrame.

    If you find this dataset helpful, please up-vote and more high-quality datasets will be published in future!❤️

  5. Taxi Data Set

    • kaggle.com
    Updated Jul 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mick Hirsh (2023). Taxi Data Set [Dataset]. https://www.kaggle.com/datasets/mickhirsh/taxi-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mick Hirsh
    Description

    First I'm give credits to Raviiloveyou who create the original Taxi trip fare predictor data set. Modify the Taxi Set to included taxi fares from Philadelphia, PA. The following costs are calculations have been updated in the dataset to include all fares for taxis
    First 1/10 mile (flag drop) or fraction thereof: $2.70 Each additional 1/10 mile or fraction thereof: $0.25 Each 37.6 seconds of wait time: $0.25 Include speed of the taxis in KPH (Kilometers per Hour)

    Columns are the following: Trip Duration in second (part of the original data set)

    Trip Duration in minutes

    Trip Duration in Hours

    Distance Traveled in Kilometers (part of the original data set)

    KPH speed of the taxis in Kilometers per Hour

    Wait Time Cost: Each 37.6 seconds of wait time: $0.25 is taxi time used to get the person to the location

    Distance Cost: Each additional 1/10 mile (.1 mile = 0.160934 KM) or fraction thereof: $0.25

    Fare w Flag: starting cost is $2.70 added into Wait Time Cost plus Distance Cost

    TIP: how much money did the taxi drive get for the trip (part of the original data set)

    Miscellaneous fees: part of the original data set

    Total Fare New: is the total cost of the trip

    Num of passengers: is the number of passengers Note there is no addition cost per passenger for Philadelphia, PA Taxis.

    surge applied: (part of the original data set)

  6. Cab Services Drivers Info

    • kaggle.com
    zip
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash (2024). Cab Services Drivers Info [Dataset]. https://www.kaggle.com/datasets/akashpawar10/cab-services-drivers-info
    Explore at:
    zip(193328 bytes)Available download formats
    Dataset updated
    Mar 31, 2024
    Authors
    Akash
    Description

    Recruiting and retaining drivers is seen by industry watchers as a tough battle for XYZCab. Churn among drivers is high and it’s very easy for drivers to stop working for the service on the fly or jump to Uber depending on the rates.

    As the companies get bigger, the high churn could become a bigger problem. To find new drivers, XYZCab is casting a wide net, including people who don’t have cars for jobs. But this acquisition is really costly. Losing drivers frequently impacts the morale of the organization and acquiring new drivers is more expensive than retaining existing ones.

    You are working as a data scientist with the Analytics Department of XYZCab, focused on driver team attrition. You are provided with the monthly information for a segment of drivers for 2019 and 2020 and tasked to predict whether a driver will be leaving the company or not based on their attributes like • Demographics (city, age, gender etc.) • Tenure information (joining date, Last Date) • Historical data regarding the performance of the driver (Quarterly rating, Monthly business acquired, grade, Income)

    Column Profiling: 1. MMMM-YY : Reporting Date (Monthly) 2. Driver_ID : Unique id for drivers 3. Age : Age of the driver 4. Gender : Gender of the driver – Male : 0, Female: 1 5. City : City Code of the driver 6. Education_Level : Education level – 0 for 10+ ,1 for 12+ ,2 for graduate 7. Income : Monthly average Income of the driver 8. Date Of Joining : Joining date for the driver 9. LastWorkingDate : Last date of working for the driver 10. Joining Designation : Designation of the driver at the time of joining 11. Grade : Grade of the driver at the time of reporting 12. Total Business Value : The total business value acquired by the driver in a month (negative business indicates cancellation/refund or car EMI adjustments) 13. Quarterly Rating : Quarterly rating of the driver: 1,2,3,4,5 (higher is better)

  7. D

    Taxi Trips

    • data.sfgov.org
    • catalog.data.gov
    csv, xlsx, xml
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Taxi Trips [Dataset]. https://data.sfgov.org/Transportation/Taxi-Trips/m8hk-2ipk
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Apr 8, 2025
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    A. SUMMARY This dataset contains information on taxi trips including pickup location, destination, and fare. Additional fields have been integrated to the raw data through automated and manual procedures to facilitate easier data analysis. Those fields are indicated in the column metadata.

    B. HOW THE DATASET IS CREATED As required by the Transportation Code, all taxi companies permitted to operate in the City and County of San Francisco transmit digital records of their fleet’s activity to SFMTA in real time through the SFMTA Taxi Application Programming Interface (API).

    C. UPDATE PROCESS This dataset will be updated monthly with new taxi trip information.

    D. HOW TO USE THIS DATASET This dataset is useful for tracking average daily taxi trip counts and monitoring the impact of the Taxi Upfront Pricing Pilot program on driver income.

    E. RELATED DATASETS

  8. Taxi Medallion Holders

  • New York City Taxi and Limousine project

    • kaggle.com
    zip
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramin Huseyn (2024). New York City Taxi and Limousine project [Dataset]. https://www.kaggle.com/datasets/raminhuseyn/new-york-city-taxi-and-limousine-project
    Explore at:
    zip(1043839 bytes)Available download formats
    Dataset updated
    Apr 22, 2024
    Authors
    Ramin Huseyn
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    The New York City Taxi and Limousine Commission (TLC) oversees the licensing and regulation of taxi cabs and for-hire vehicles in the city. The TLC gathers data from over 200,000 license holders, including taxi drivers and limousine operators, who collectively complete around one million trips each day.

    Note: The dataset used for this project was designed for educational purposes and may not accurately represent the behavior of taxi cab riders in New York City.

    Column nameDescription
    IDTrip identification number
    VendorIDA code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
    tpep_pickup_datetimeThe date and time when the meter was engaged
    tpep_dropoff_datetimeThe date and time when the meter was disengaged
    Passenger_countThe number of passengers in the vehicle. This is a driver-entered value
    Trip_distanceThe elapsed trip distance in miles reported by the taximeter
    RateCodeIDThe final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
    Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before being sent to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
    PULocationIDTLC Taxi Zone in which the taximeter was engaged
    DOLocationIDTLC Taxi Zone in which the taximeter was disengaged
    Payment_typeA numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
    Fare_amountThe time-and-distance fare calculated by the meter
    ExtraMiscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges
    MTA_tax$0.50 MTA tax that is automatically triggered based on the metered rate in use
    Tip_amountTip amount – This field is automatically populated for credit card tips. Cash tips are not included
    Tolls_amountTotal amount of all tolls paid in trip
    Improvement_surcharge$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015
    Total_amountThe total amount charged to passengers. Does not include cash tips
  • Taxi Trajectory Data

    • kaggle.com
    zip
    Updated Apr 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Cross (2018). Taxi Trajectory Data [Dataset]. https://www.kaggle.com/crailtap/taxi-trajectory
    Explore at:
    zip(540159049 bytes)Available download formats
    Dataset updated
    Apr 12, 2018
    Authors
    Chris Cross
    Description

    Context

    Technology has many effects on the transportation industry.

    Content

    We have provided an accurate dataset describing a complete year (from 01/07/2013 to 30/06/2014) of the trajectories for all the 442 taxis running in the city of Porto, in Portugal (i.e. one CSV file named "train.csv"). These taxis operate through a taxi dispatch central, using mobile data terminals installed in the vehicles. We categorize each ride into three categories: A) taxi central based, B) stand-based or C) non-taxi central based. For the first, we provide an anonymized id, when such information is available from the telephone call. The last two categories refer to services that were demanded directly to the taxi drivers on a B) taxi stand or on a C) random street.

    Each data sample corresponds to one completed trip. It contains a total of 9 (nine) features, described as follows:

    • TRIP_ID: (String) It contains an unique identifier for each trip;

    • CALL_TYPE: (char) It identifies the way used to demand this service. It may contain one of three possible values: ‘A’ if this trip was dispatched from the central; ‘B’ if this trip was demanded directly to a taxi driver on a specific stand; ‘C’ otherwise (i.e. a trip demanded on a random street).

    • ORIGIN_CALL: (integer) It contains an unique identifier for each phone number which was used to demand, at least, one service. It identifies the trip’s customer if CALL_TYPE=’A’. Otherwise, it assumes a NULL value;

    • ORIGIN_STAND: (integer): It contains an unique identifier for the taxi stand. It identifies the starting point of the trip if CALL_TYPE=’B’. Otherwise, it assumes a NULL value;

    • TAXI_ID: (integer): It contains an unique identifier for the taxi driver that performed each trip;

    • TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start;

    • DAYTYPE: (char) It identifies the daytype of the trip’s start. It assumes one of three possible values: ‘B’ if this trip started on a holiday or any other special day (i.e. extending holidays, floating holidays, etc.); ‘C’ if the trip started on a day before a type-B day; ‘A’ otherwise (i.e. a normal day, workday or weekend).

    • MISSING_DATA: (Boolean) It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing

    • POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. [ and ], respectively). Each pair of coordinates is also identified by the same brackets as [LONGITUDE, LATITUDE]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start;

    The total travel time of the trip (the prediction target of this competition) is defined as the (number of points-1) x 15 seconds. For example, a trip with 101 data points in POLYLINE has a length of (101-1) * 15 = 1500 seconds. Some trips have missing data points in POLYLINE, indicated by MISSING_DATA column, and it is part of the challenge how you utilize this knowledge.

    Acknowledgements

    Data from ECML/PKDD 15: Taxi Trip Time Prediction (II) Competition

    Inspiration

    Added this dataset because competition datasets do not appear in the dataset search and this dataset could help learn basic methods in the area of geo-spatial analysis and trajectory handling

  • H

    Replication Data for: Heat Causes Large Earnings Losses for Informal-Sector...

    • dataverse.harvard.edu
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E. Somanathan; Saudamini Das (2024). Replication Data for: Heat Causes Large Earnings Losses for Informal-Sector Workers in India [Dataset]. http://doi.org/10.7910/DVN/1Q5HZD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    E. Somanathan; Saudamini Das
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    India
    Description

    This is a dataset of 400 workers collected daily in the months of May and June of 2019 in Delhi. These workers were working as launderers, construction workers, painters, coolies (manual laborers in transport or other sectors), cycle rickshaw drivers, electric rickshaw drivers, auto (three-wheeled taxi) drivers, taxi drivers, food vendors, street vendors, rag pickers, petty traders, fruit sellers, waste and scrap dealers, roadside barbers, cobblers, roadside cycle/auto mechanics, and others. We collect data on their earnings, expenditure and health. The data was merged with temperature data from the meteorological station at Delhi Airport.

  • y

    Taxi Licences - Dataset - York Open Data

    • data.yorkopendata.org
    Updated May 22, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Taxi Licences - Dataset - York Open Data [Dataset]. https://data.yorkopendata.org/dataset/taxi-licenses
    Explore at:
    Dataset updated
    May 22, 2017
    License

    Open Government Licence 2.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/
    License information was derived automatically

    Area covered
    York
    Description

    • Hackney carriage vehicles on 1st of June • Private hire vehicles on 1st of November A list of all Hackney Carriage and Private Hire vehicle licences issued by City of York Council. The list can be filtered to identify Wheelchair Accessible Vehicles as per Section 167 of the Equality Act 2010. For further information please visit City of York Council's website.

  • Newyork Yellow Taxi Trip Data

    • kaggle.com
    zip
    Updated Jul 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripathi Mohanasundaram (2021). Newyork Yellow Taxi Trip Data [Dataset]. https://www.kaggle.com/microize/newyork-yellow-taxi-trip-data-2020-2019
    Explore at:
    zip(1938408118 bytes)Available download formats
    Dataset updated
    Jul 25, 2021
    Authors
    Sripathi Mohanasundaram
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Context

    The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).

    Content

    Column Description

    • VendorID : A code indicating the TPEP provider that provided the record. ---- 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
    • tpep_pickup_datetime : The date and time when the meter was engaged.
    • tpep_dropoff_datetime : The date and time when the meter was disengaged.
    • Passenger_count : The number of passengers in the vehicle.( This is a driver-entered value )
    • Trip_distance : The elapsed trip distance in miles reported by the taximeter.
    • PULocationID : TLC Taxi Zone in which the taximeter was engaged
    • DOLocationID :TLC Taxi Zone in which the taximeter was disengaged *RateCodeID : The final rate code in effect at the end of the trip. ---- 1= Standard rate ---- 2=JFK ---- 3=Newark ---- 4=Nassau or Westchester ---- 5=Negotiated fare ---- 6=Group ride
    • Store_and_fwd_flag : This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. ---- Y= store and forward trip ---- N= not a store and forward trip
    • Payment_type A numeric code signifying how the passenger paid for the trip. ---- 1= Credit card ---- 2= Cash ---- 3= No charge ---- 4= Dispute ---- 5= Unknown ---- 6= Voided trip
    • Fare_amount : The time-and-distance fare calculated by the meter.
    • Extra : Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
    • MTA_tax : $0.50 MTA tax that is automatically triggered based on the metered rate in use.
    • Improvement_surcharge : $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
    • Tip_amount : Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
    • Tolls_amount : Total amount of all tolls paid in trip.
    • Total_amount : The total amount charged to passengers. Does not include cash tips.

    Acknowledgements

    Data is obtained from NYCTaxi & Limousine Commission website. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

  • Taxi Trip Fare Prediction Challenge

    • kaggle.com
    zip
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2022). Taxi Trip Fare Prediction Challenge [Dataset]. https://www.kaggle.com/gauravduttakiit/taxi-trip-fare-prediction-challenge
    Explore at:
    zip(1082094 bytes)Available download formats
    Dataset updated
    Nov 17, 2022
    Authors
    Gaurav Dutta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    Through a real-world challenge, this hackathon aims to enhance competitors' data science and innovative analytical thinking abilities. Get an opportunity to work on a remarkable data science technology by competing with the best brains in this area at this point in time, where artificial intelligence and machine learning are at the forefront of attention, and find out how you stack up!

    This hackathon will try to address the challenges faced by taxi operators in quoting the right fare to customers before starting the trip. However, the details are shared with taxi drivers or operators related to the trip, they find it difficult to quote the right fare because of uncertainties and calculation complexities. The same issue is faced by passengers as well because of inaccurate or irrelevant fares quoted. To find a solution for this, this hackathon provides a historical dataset to participants that includes records of taxi trip details and fares of those trips. Using this dataset, the participants need to build machine learning models for predicting the trip fare based on the given other useful features of the trip.

    Overall, it involves using a dataset, finding the best set of features from the dataset, building a machine learning model to predict trip fare based on other trip features and evaluating the predictions using mean squared error and finally submitting the predictions in the given template.

    Data description:

    Trip_distance: The elapsed trip distance in miles reported by the taximeter. Rate_code: The final rate code is in effect at the end of the trip. 1= Standard rate,2=JFK,3=Newark, 4=Nassau or Westchester, 5=Negotiated fare,6=Group ride Storeandfwd_flag: This flag indicates whether the trip record was held in vehicle memory before sending it to the vendor and determines if the trip was stored in the server and forwarded to the vendor. Y= store and forward trip N= not a store and forward trip Payment_type: A numeric code signifying how the passenger paid for the trip. 1= Credit card,2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip Fare_amount: The time-and-distance fare calculated by the meter Extra: Miscellaneous extras and surcharges. Mta_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use. Tip_amount: Tip amount credited to the driver for credit card transactions. Tolls_amount: Total amount of all tolls paid in the trip. Imp_surcharge: $0.30 extra charges added automatically to all rides Total_amount: The total amount charged to passengers. Does not include cash tips Pickuplocationid: TLC Taxi Zone in which the taximeter was engaged Dropofflocationid: TLC Taxi Zone in which the taximeter was disengaged Year: The year in which the taxi trip was taken. Month: The month on which the taxi trip was taken. Day: The day on which the taxi trip was taken. Day_of_week: The day of the week on which the taxi trip was taken Hour_of_day: Used to determine the hour of the day in 24 hours format Trip_duration: The total duration of the trip in seconds calculated_total_amount: The total amount the customer has to pay for the taxi.

  • NYC Yellow Taxi Trip Records

    • kaggle.com
    zip
    Updated Jun 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    psv (2023). NYC Yellow Taxi Trip Records [Dataset]. https://www.kaggle.com/datasets/psvishnu/nyc-yellow-taxi-trip-records
    Explore at:
    zip(29733373878 bytes)Available download formats
    Dataset updated
    Jun 18, 2023
    Authors
    psv
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    About TLC Trip Record Data

    Yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemised fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorised under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

    For-Hire Vehicle (“FHV”) trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. Note: The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness. Therefore, this may not represent the total amount of trips dispatched by all TLC-licensed bases. The TLC performs routine reviews of the records and takes enforcement actions when necessary to ensure, to the extent possible, complete and accurate information.

    Data Source: TLC

    https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

    Data dictionary

    https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

    Sr no.Field NameDescription
    1.VendorIDA code indicating the TPEP provider that provided the record.
    1 = Creative Mobile Technologies, LLC
    2 = VeriFone Inc.
    2.tpep_pickup_datetimeThe date and time when the meter was engaged.
    3.tpep_dropoff_datetimeThe date and time when the meter was disengaged.
    4.Passenger_countThe number of passengers in the vehicle. (Driver-entered value)
    5.Trip_distanceThe elapsed trip distance in miles reported by the taximeter.
    6.PULocationIDTLC Taxi Zone in which the taximeter was engaged.
    7.DOLocationIDTLC Taxi Zone in which the taximeter was disengaged.
    8.RateCodeIDThe final rate code in effect at the end of the trip.
    1 = Standard rate
    2 = JFK
    3 = Newark
    4 = Nassau or Westchester
    5 = Negotiated fare
    6 = Group ride
    9.Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor.
    Y = store and forward trip
    N = not a store and forward trip
    10.Payment_typeA numeric code signifying how the passenger paid for the trip.
    1 = Credit card
    2 = Cash
    3 = No charge
    4 = Dispute
    5 = Unknown
    6 = Voided trip
    11.Fare_amountThe time-and-distance fare calculated by the meter.
    12.ExtraMiscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
    13.MTA_tax$0.50 MTA tax that is automatically triggered based on the metered rate in use.
    14.Improvement_surcharge$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
    15.Tip_amountTip amount – This field is automatically populated for credit card tips. Cash tips are not included.
    16.Tolls_amountTotal amount of all tolls paid in trip.
    17.Total_amountThe total amount charged to passengers. Does not include cash tips.
    18.Congestion_SurchargeTotal amount collected in trip for NYS congestion surcharge.
    19.Airport_fee$1.25 for pick up only at LaGuardia and John F. Kennedy Airports.

    Photo by Mourad Saadi on Unsplash

  • Gett Taxi Interview Assignment

    • kaggle.com
    zip
    Updated Nov 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abilash Reddy (2024). Gett Taxi Interview Assignment [Dataset]. https://www.kaggle.com/datasets/datadoodler/gett-taxi-interview-assignment
    Explore at:
    zip(3196714 bytes)Available download formats
    Dataset updated
    Nov 2, 2024
    Authors
    Abilash Reddy
    Description

    https://www.gett.com/uk/wp-content/uploads/sites/6/2022/11/top_illustration_desktop.svg" alt=""> Gett, previously known as GetTaxi, is an Israeli-developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not completed successfully, i.e., the customer didn't end up getting a car.

    Assignment Please complete the following tasks. 1. Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. Analyse the resulting plot. Which category has the highest number of orders? 2. Plot the distribution of failed orders by hours. Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained? 3. Plot the average time to cancellation with and without driver, by the hour. If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot? 4. Plot the distribution of average ETA by hours. How can this plot be explained? 5. BONUS Hexagons. Using the h3 and folium packages, calculate how many sizes 8 hexes contain 80% of all orders from the original data sets and visualise the hexes, colouring them by the number of fails on the map.

    We have two data sets: data_orders and data_offers, both being stored in a CSV format. The data_orders data set contains the following columns: 1. order_datetime - time of the order 2. origin_longitude - longitude of the order 3. origin_latitude - latitude of the order 4. m_order_eta - time before order arrival 5. order_gk - order number 6. order_status_key - status, an enumeration consisting of the following mapping:= 4 - cancelled by client, 9 - cancelled by system, i.e., a reject 7. is_driver_assigned_key - whether a driver has been assigned 8. cancellation_time_in_seconds - how many seconds passed before cancellation

    The data_offers data set is a simple map with 2 columns: 1. order_gk - order number, associated with the same column from the data_orders data set 2. offer_id - ID of an offer

  • Taxi trip data NYC

    • kaggle.com
    zip
    Updated Jun 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anandaram Ganapathi (2022). Taxi trip data NYC [Dataset]. https://www.kaggle.com/datasets/anandaramg/taxi-trip-data-nyc/discussion?sort=undefined
    Explore at:
    zip(1710447 bytes)Available download formats
    Dataset updated
    Jun 11, 2022
    Authors
    Anandaram Ganapathi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    NYC Cabs

    If you live in a mid-to-large-sized city and take taxis, you have probably already tried Uber. What you may not know is that the transportation app has different rates in each city. New York City is arguably the taxi capital of America and home to the classic yellow taxicab.

    They do have some similarities—both conventional taxis and Uber charge fares based on a combination of time and distance. Both also charge passengers for any bridge or road tolls in addition to the fare. However, there are also significant differences between Uber and taxis in New York City. Which is the quickest and most economical ride in New York City, and what are the differences between Uber and Yellow Cab?

    Key Takeaways

    Both conventional taxis and Uber charge fares based on a combination of time and distance. Taxis do not have surge pricing, but riders might have to wait longer when demand exceeds supply. Uber does not differentiate between cruising and stop-and-go traffic, while taxis do charge different rates based on speed.

    Uber does not differentiate between cruising and stop-and-go traffic, while taxis do charge different rates based on speed. In addition, Uber has price hikes during times of high demand, while taxis have extra rush hour fees. Uber does provide fare estimates within the Uber app, but it does not guarantee the final fare because road conditions can change during the ride.

    The service is only accessible through an up-to-date smartphone. If you do not own a smartphone, your smartphone is not up to date, or you forgot your phone, you will not be able to use Uber. New York City regulations prohibit street hails for private ride services (also called livery services).

    Yellow Cabs

    Getting into a taxi in an unfamiliar city can be nerve-wracking. You have no idea how much the trip should cost or if the driver is taking the most direct route. In New York City, taxi riders cannot get an advance estimate for taxi fares. The NYC Taxi and Limousine Commission’s official stance is that “it is impossible to pre-calculate a fare because the meter rate depends on traffic, construction, weather, and route to the destination.”

    Yellow cabs accept street hails anywhere in New York City. Green Boro Taxis, which operate in the outer boroughs and parts of Manhattan north of certain streets, can either be prearranged or hailed on the street.

    Uber Cabs

    Uber has something called surge pricing, which refers to the higher fares it imposes during times of high rider demand. Surge pricing can take effect during rush hour, during a natural disaster, or during a random spike of requests on a Saturday afternoon. Uber claims these price increases are meant to encourage more Uber drivers to get out on the road, and that prices revert to normal when supply and demand even out—capitalism at its finest. The Uber app notifies users of surge pricing when they request a ride.

    Uber used to offer a $60 flat rate between Manhattan and JFK but dropped that option. Rates are now calculated based on time and distance.

    Taxis do not have surge pricing, but riders might have to wait longer when demand exceeds supply. Taxis do, however, add a $0.50 surcharge in the evening (8:00 p.m. to 6:00 a.m.) and a $1 surcharge during rush hour (4:00 p.m. to 8:00 p.m.), Monday through Friday. If Uber’s surge pricing is in effect, you will probably pay a lot less by taking a cab, if you can get one. Surge pricing will at least double your usual fare, and Uber has reported charging customers as much as $39 per mile. A New York City councilman introduced a bill in January 2015 proposing to limit surge pricing to twice the usual rate.

    Yellow cabs have regulated fares to and from the Newark International and John F. Kennedy International airports. For trips between Newark International Airport and New York City, the price is the regular metered fare, plus a $17.50 surcharge, plus tolls. For trips between John F. Kennedy International Airport and Manhattan, it is a flat fare of $52 plus tolls. The regular metered fare applies to all trips to and from LaGuardia International Airport.

    Payments and Tipping

    Before you can call an Uber, you must download the app onto your smartphone and register a credit card or PayPal account to your Uber account. Uber automatically charges your account at the end of the ride. When you take a cab, you can pay with cash, credit card, or a payment app on your phone, like Apple Pay.

    Tipping is different with each service, too. Uber allows riders to tip their driver through the app after they have rated their ride, once complete. You have 30 days to add a tip once your ride is complete.

    NYC cab drivers are required to accept MasterCard, Visa, Discover, and American Express credit cards and MasterCard and Visa debit cards with no minimum fare requirement. Passengers pay for rides by swiping their card through a card reader a...

  • Encoded shortest path sequences for NYC taxi trip

    • kaggle.com
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lem (2017). Encoded shortest path sequences for NYC taxi trip [Dataset]. https://www.kaggle.com/tongjiyiming/encoded-shortest-path-sequences-for-nyc-taxi-trip
    Explore at:
    zip(140239784 bytes)Available download formats
    Dataset updated
    Sep 8, 2017
    Authors
    Lem
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    New York
    Description

    Get a closest approximate of real trip trace

    For NYC taxi trip, there is only start coordinates and end coordinates, which is hard to be used to explore variation of different road condition. This dataset uses OSM road data and break it into small directed segments. Each segment is defined from one cross (node) to adjacent cross (node). And, it has direction, which means two-ways road will result in two segment, and oneway road will result in one segment.

    What you get

    Scipy`s .npz format

    141505 columns: each column encoded a small segment. Its value is just an indicator: 1 means taxi would travel through this segment, 0 means not. As you can see, it results in a very sparse matrix.

    Some insights

    This is inspired by ECML/PKDD 15: Taxi Trajectory Prediction. Apparently, with more accurate trajectory of trips, we create a space that different trip`s information can be shared by more others. If we only got start and end point, similarity of two trips only depends on a clustering of start and end point, which we hope, could have some accurate similarity approximation (which also highly depend on how many clusters you define). But, with path sequences, we can know that two quite different trips can share some common but important parts of roads, such as motorways. This is closer to real life. More importantly, now, we can learn the situation of that road segments from many different trips, as long as we have a suitable machine learning algorithm. Similar to the winners in ECML/PKDD 15, this dataset allows deep learning to be applied.

    The original road data is from OSM. Library osmnx, networkx are used to store road graph. Speedlimit data is primary got from NYC`s DOT. A shortest path library in java developed by Arizona State University is used for processing shortest path using Dijkstra Algorithm. Using Pyjnius to use java library inside Python. Additionally, with some multithread programming code in both python and java to speedup the whole execution.

    The initial idea is actually to get Top K paths, so that it provides a probabilistic information of taxi driver drives along. It is too slow as I run the Yen`s Top-K algorithms.

    Time dependent linkage might also help. But, linkage between different segments are not considered, since I have no idea how to map that information to a useful feature space.

    Notice that this data actually use exactly same information as New York City Taxi with OSRM. The difference is that that data only give a name of a road, but this dataset encode each small segments. However, total time from that dataset is also proved to be useful. Unfortunately, I did not record the trip time by my codes. We will see if anyone ask.

    So, have fun with this dataset, Kagglers!

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  • Namma Yatri Cab Bookings Bangalore Open Data

    • kaggle.com
    zip
    Updated Jan 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishant Singhal (2024). Namma Yatri Cab Bookings Bangalore Open Data [Dataset]. https://www.kaggle.com/datasets/stacknishant/namma-yatri-cab-bookings-bangalore-open-data
    Explore at:
    zip(17393 bytes)Available download formats
    Dataset updated
    Jan 12, 2024
    Authors
    Nishant Singhal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Bengaluru
    Description

    The cab bookings data is from namma yatri ride-hailing services within the Bangalore region. It is downloaded from nammayatri.in

    Namma Yatri has become Bengaluru's most loved auto app, since its formal launch in January 2023. It is a Direct-to-Driver app. There is no commission or middle-men. What one pays goes 100% to the Driver and his family.

    Here is the github page: https://github.com/nammayatri

  • Rides sample

    • kaggle.com
    zip
    Updated Oct 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Easy (2019). Rides sample [Dataset]. https://www.kaggle.com/datasets/easytaxi/week-18-rides-sample
    Explore at:
    zip(93924512 bytes)Available download formats
    Dataset updated
    Oct 21, 2019
    Dataset authored and provided by
    Easy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Easy (Taxi) is a mobile E-hailing application available in many countries in Latin America. The app allows users to book a taxi and track it in real time.

    This dataset contains a sample of rides that were requested by Easy's passengers.

    This dataset is being used in the Easy selection process for the Data Engineering Team. We will evaluate the following topics from your solution:

    • Code legibility and understandability;
    • Code and solution structure, with the perspective of future maintenance and evolution;
    • Solution compatibility with a Big Data stack.

    Content

    This data is a sample collected in the 18th week of 2018 and anonymized for privacy purposes.

    You are going to find the following schema in the rides.csv file (inside the .zip file):

    • ride_id: Unique ride identifier
    • city_code: The IATA code representing where the ride was requested.
    • country_code: The associated ISO-3166 code for the city_code.
    • passenger_id: Unique passenger identifier
    • requested_at: The timestamp of the ride request event
    • payment_created_at: The timestamp of payment date
    • boarded_at: This is filled when the passenger boards the requested ride
    • driver_id: Unique driver identifier
    • payment_final_value: This is populated at the end of the ride, with the monetary value paid by the passenger in local currency.
    • rating_stars: The stars value given by the passenger after the end of the ride.

    Acknowledgements

    This dataset was collected from Easy's data and never distributed before.

    Inspiration

    By looking to the past we can better understand how urban mobility happens on the cities and then we can make better plans for the future.

    We hope you can answer at least one of the following questions:

    • What is the average ride payment value?
    • How many rides were done on the period?
    • What is the ride conversion rate? ride conversion rate is the ratio between the number of rides that were requested and the number of rides that were effectively done.
    • How are the 10 best drivers? (better evaluated)
    • What is the average number of rides that a driver do?
    • What is our capacity of attending rides, considering all drivers available on the given sample?
  • NLU-NLG Dataset

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaushik T.D. Roy (2025). NLU-NLG Dataset [Dataset]. https://www.kaggle.com/datasets/kaushiktdroy/nlu-nlg-dataset
    Explore at:
    zip(9079410 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    Kaushik T.D. Roy
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Customer Service Natural Language Generation and Understanding Dataset

    Overview

    This dataset contains cleaned and processed customer service conversations designed for both Natural Language Generation (NLG) and Natural Language Understanding (NLU) tasks. The data focuses on customer inquiries across various service categories including refunds, bookings, and cancellations, with corresponding human agent responses and detailed annotations.

    Dataset Structure

    NLG Component

    The Natural Language Generation portion contains instruction-following examples for customer service response generation:

    Fields: - instruction: Task description specifying the customer query type and emotional state - context: The actual customer message/inquiry - response: The appropriate agent response

    Format Example: json { "instruction": "A customer has a query about refund. They are feeling NEGATIVE. Draft a helpful response.", "context": "who do i send my taxi receipt to for reimbursement please? (hersham to walton at 12.20)", "response": "what day did you travel please?" }

    NLU Component

    The Natural Language Understanding portion provides comprehensive annotations for customer messages:

    Fields: - text: The customer's original message - intents: List of identified intents/purposes (e.g., "refund", "booking", "cancellation") - sentiment: Sentiment classification (e.g., "NEGATIVE", "POSITIVE") - entities: Named entities extracted from the text (currently empty arrays, indicating entity extraction preprocessing)

    Format Example: json { "text": "who do i send my taxi receipt to for reimbursement please? (hersham to walton at 12.20)", "intents": ["refund"], "sentiment": ["NEGATIVE"], "entities": [] }

    Data Characteristics

    Intent Categories

    • Refund: Customer inquiries about reimbursements, receipt submissions, and payment issues
    • Booking: Reservation-related queries, modifications, and booking assistance
    • Cancellation: Service cancellation requests and related issues

    Sentiment Distribution

    • NEGATIVE: Customer frustration, complaints, or urgent requests
    • POSITIVE: Appreciation, thanks, or satisfied interactions

    Domain Context

    The dataset appears to focus on transportation/travel services, with references to: - Taxi receipts and reimbursements - Flight bookings and upgrades - Location-based services (e.g., "hersham to walton", "man-eus")

    Use Cases

    NLG Applications

    • Customer service chatbot response generation
    • Automated agent assistance tools
    • Response quality evaluation and training
    • Instruction-following model fine-tuning

    NLU Applications

    • Intent classification systems
    • Sentiment analysis in customer service
    • Multi-label classification tasks
    • Customer query routing and prioritization

    Data Quality

    • Cleaned: Processed to remove inconsistencies and formatting issues
    • Anonymized: Personal details appear to be removed or genericized
    • Balanced: Includes both positive and negative sentiment examples
    • Realistic: Contains authentic customer service language patterns

    Technical Notes

    • All text is in English
    • JSON format for easy integration with ML pipelines
    • Consistent schema across all records
    • Ready for immediate use in training/evaluation workflows

    Potential Applications

    • Training conversational AI systems
    • Benchmarking NLU models
    • Customer service automation research
    • Sentiment analysis in business contexts
    • Multi-task learning experiments combining NLG and NLU

    Limitations

    • Limited entity annotations (entities field is empty)
    • Moderate dataset size
    • Specific to customer service domain
    • May require additional preprocessing for certain applications

    This dataset serves as a valuable resource for researchers and practitioners working on customer service automation, conversational AI, and natural language processing applications in business contexts.

  • Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elemento (2021). NYC Yellow Taxi Trip Data [Dataset]. https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data
    Organization logo

    NYC Yellow Taxi Trip Data

    Pratice your ML skills on this Time-Series Dataset!

    Explore at:
    zip(1915626894 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Elemento
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    New York
    Description

    Context

    New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. You can access it here. Now, the TLC primarily keeps and manages data for 4 different types of vehicles: - Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged. - Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides. - For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

    Complimentary Kernel

    • I have made a Kernel especially for this dataset, which uses Clustering, Regression, and Time-Series techniques for this dataset. You can check it out here.

    Important Points

    • In this dataset, we are considering only the Yellow Taxis Data, for the months of Jan 2015 & Jan-mar 2016.
    • If you go over to the website of NYC TLC, and download any of the CSV files, you will find a different format of these files. This is because, the TLC regularly adds more data, alongside updating the existing one.
    • One of the key changes that they have made to their data is that, instead of providing the pickup & dropoff coordinates, they have divided the NYC into regions and indexed those regions, and in the CSV files, they have provided these indices.
    • Due to this reason only, I have made this dataset using the previous version of the CSV files. This dataset allows me to practice my clustering knowledge alongside my time-series knowledge.
    • If you want to leave out the clustering part, then just go over to their website, and download the new CSV files.

    Attributes

    ...

    Field NameDescription
    VendorID A code indicating the TPEP provider that provided the record.
    1. Creative Mobile Technologies
    2. VeriFone Inc.
    tpep_pickup_datetimeThe date and time when the meter was engaged.
    tpep_dropoff_datetimeThe date and time when the meter was disengaged.
    Passenger_countThe number of passengers in the vehicle. This is a driver-entered value.
    Trip_distanceThe elapsed trip distance in miles reported by the taximeter.
    Pickup_longitudeLongitude where the meter was engaged.
    Pickup_latitudeLatitude where the meter was engaged.
    RateCodeIDThe final rate code in effect at the end of the trip.
    1. Standard rate
    2. JFK
    3. Newark
    4. Nassau or Westchester
    5. Negotiated fare
    6. Group ride
    Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor,
    aka “store and forward,” because the vehicle did not have a connection to the server.
    Y= store and forward trip
    N= not a store and forward trip
    Dropoff_longitudeLongitude where the meter was disengaged.
    Dropoff_ latitudeLatitude where the meter was disengaged.
    Payment_typeA numeric code signifying how the passenger paid for the trip.
    1. Credit card
    2. Cash
    3. No charge
    4. Dispute
    5. Unknown
    6. Voided trip
    Fare_amountThe time-and-distance fare calculated by the meter.
    ExtraMiscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
    MTA_tax0.50 MTA tax that is automatically triggered based on the metered rate in use.
    Improvement_surcharge0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
    Search
    Clear search
    Close search
    Google apps
    Main menu