{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting things up\n", "\n", "## About this notebook\n", "\n", "In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ``ContinuousCarver`` pipeline. Renowned for its association-maximizing discretization, ``ContinuousCarver`` is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.\n", "\n", "The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ``ContinuousCarver``, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.\n", "\n", "Throughout this notebook, we'll explore the intricacies of ``ContinuousCarver``'s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ``ContinuousCarver`` ensures that each feature is finely tuned for our regression tasks.\n", "\n", "Join us in this exploration as we leverage the power of ``ContinuousCarver`` to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.\n", "\n", "Let's dive in and uncover the potential of ``ContinuousCarver`` in transforming the California Housing Prices Dataset for optimal predictive modeling.\n", "\n", "\n", "## Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %pip install AutoCarver[jupyter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Califorinia Housing Prices Data\n", "\n", "In this example notebook, we will use the California Housing Prices dataset.\n", "\n", "The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.\n", "\n", "Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | MedInc | \n", "HouseAge | \n", "AveRooms | \n", "AveBedrms | \n", "Population | \n", "AveOccup | \n", "Latitude | \n", "Longitude | \n", "MedHouseVal | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "8.3252 | \n", "41.0 | \n", "6.984127 | \n", "1.023810 | \n", "322.0 | \n", "2.555556 | \n", "37.88 | \n", "-122.23 | \n", "4.526 | \n", "
| 1 | \n", "8.3014 | \n", "21.0 | \n", "6.238137 | \n", "0.971880 | \n", "2401.0 | \n", "2.109842 | \n", "37.86 | \n", "-122.22 | \n", "3.585 | \n", "
| 2 | \n", "7.2574 | \n", "52.0 | \n", "8.288136 | \n", "1.073446 | \n", "496.0 | \n", "2.802260 | \n", "37.85 | \n", "-122.24 | \n", "3.521 | \n", "
| 3 | \n", "5.6431 | \n", "52.0 | \n", "5.817352 | \n", "1.073059 | \n", "558.0 | \n", "2.547945 | \n", "37.85 | \n", "-122.25 | \n", "3.413 | \n", "
| 4 | \n", "3.8462 | \n", "52.0 | \n", "6.281853 | \n", "1.081081 | \n", "565.0 | \n", "2.181467 | \n", "37.85 | \n", "-122.25 | \n", "3.422 | \n", "
| \n", " | MedInc | \n", "HouseAge | \n", "AveRooms | \n", "AveBedrms | \n", "Population | \n", "AveOccup | \n", "Latitude | \n", "Longitude | \n", "MedHouseVal | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 5088 | \n", "0.9809 | \n", "19.0 | \n", "3.187726 | \n", "1.129964 | \n", "726.0 | \n", "2.620939 | \n", "33.98 | \n", "-118.28 | \n", "1.214 | \n", "
| 17096 | \n", "4.2232 | \n", "33.0 | \n", "6.189696 | \n", "1.086651 | \n", "1015.0 | \n", "2.377049 | \n", "37.46 | \n", "-122.23 | \n", "3.637 | \n", "
| 5617 | \n", "3.5488 | \n", "42.0 | \n", "4.821577 | \n", "1.095436 | \n", "1044.0 | \n", "4.331950 | \n", "33.79 | \n", "-118.26 | \n", "2.056 | \n", "
| 20060 | \n", "1.6469 | \n", "24.0 | \n", "4.274194 | \n", "1.048387 | \n", "1686.0 | \n", "4.532258 | \n", "35.87 | \n", "-119.26 | \n", "0.476 | \n", "
| 895 | \n", "3.9909 | \n", "14.0 | \n", "4.608303 | \n", "1.089350 | \n", "2738.0 | \n", "2.471119 | \n", "37.54 | \n", "-121.96 | \n", "2.360 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 1.60e+00 | \n", "1.1102 | \n", "0.0500 | \n", "692 | \n", "
| 1.60e+00 < x <= 1.91e+00 | \n", "1.1285 | \n", "0.0500 | \n", "691 | \n", "
| 1.91e+00 < x <= 2.15e+00 | \n", "1.2198 | \n", "0.0500 | \n", "692 | \n", "
| 2.15e+00 < x <= 2.35e+00 | \n", "1.3171 | \n", "0.0500 | \n", "691 | \n", "
| 2.35e+00 < x <= 2.57e+00 | \n", "1.3817 | \n", "0.0500 | \n", "691 | \n", "
| 2.57e+00 < x <= 2.74e+00 | \n", "1.5409 | \n", "0.0500 | \n", "692 | \n", "
| 2.74e+00 < x <= 2.98e+00 | \n", "1.6159 | \n", "0.0500 | \n", "692 | \n", "
| 2.98e+00 < x <= 3.14e+00 | \n", "1.6906 | \n", "0.0499 | \n", "690 | \n", "
| 3.14e+00 < x <= 3.32e+00 | \n", "1.8232 | \n", "0.0500 | \n", "692 | \n", "
| 3.32e+00 < x <= 3.54e+00 | \n", "1.9059 | \n", "0.0500 | \n", "691 | \n", "
| 3.54e+00 < x <= 3.73e+00 | \n", "2.0076 | \n", "0.0502 | \n", "694 | \n", "
| 3.73e+00 < x <= 3.97e+00 | \n", "2.0271 | \n", "0.0498 | \n", "689 | \n", "
| 3.97e+00 < x <= 4.18e+00 | \n", "2.1456 | \n", "0.0500 | \n", "691 | \n", "
| 4.18e+00 < x <= 4.46e+00 | \n", "2.2433 | \n", "0.0500 | \n", "691 | \n", "
| 4.46e+00 < x <= 4.76e+00 | \n", "2.3621 | \n", "0.0501 | \n", "693 | \n", "
| 4.76e+00 < x <= 5.12e+00 | \n", "2.3986 | \n", "0.0499 | \n", "690 | \n", "
| 5.12e+00 < x <= 5.54e+00 | \n", "2.6438 | \n", "0.0500 | \n", "691 | \n", "
| 5.54e+00 < x <= 6.16e+00 | \n", "2.9324 | \n", "0.0500 | \n", "692 | \n", "
| 6.16e+00 < x <= 7.32e+00 | \n", "3.4592 | \n", "0.0500 | \n", "691 | \n", "
| 7.32e+00 < x | \n", "4.3784 | \n", "0.0500 | \n", "692 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.1017 | \n", "0.0509 | \n", "347 | \n", "
| 1.0410 | \n", "0.0502 | \n", "342 | \n", "
| 1.2407 | \n", "0.0501 | \n", "341 | \n", "
| 1.2919 | \n", "0.0506 | \n", "345 | \n", "
| 1.4676 | \n", "0.0536 | \n", "365 | \n", "
| 1.5605 | \n", "0.0417 | \n", "284 | \n", "
| 1.6280 | \n", "0.0584 | \n", "398 | \n", "
| 1.7519 | \n", "0.0471 | \n", "321 | \n", "
| 1.8443 | \n", "0.0504 | \n", "343 | \n", "
| 1.8500 | \n", "0.0498 | \n", "339 | \n", "
| 2.0040 | \n", "0.0533 | \n", "363 | \n", "
| 2.0890 | \n", "0.0502 | \n", "342 | \n", "
| 2.1641 | \n", "0.0505 | \n", "344 | \n", "
| 2.2700 | \n", "0.0540 | \n", "368 | \n", "
| 2.3768 | \n", "0.0439 | \n", "299 | \n", "
| 2.5087 | \n", "0.0479 | \n", "326 | \n", "
| 2.6814 | \n", "0.0483 | \n", "329 | \n", "
| 2.9805 | \n", "0.0479 | \n", "326 | \n", "
| 3.3748 | \n", "0.0530 | \n", "361 | \n", "
| 4.3748 | \n", "0.0483 | \n", "329 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 2.57e+00 | \n", "1.2314 | \n", "0.2500 | \n", "3457 | \n", "
| 2.57e+00 < x <= 3.97e+00 | \n", "1.8016 | \n", "0.3500 | \n", "4840 | \n", "
| 3.97e+00 < x <= 5.54e+00 | \n", "2.3587 | \n", "0.2499 | \n", "3456 | \n", "
| 5.54e+00 < x | \n", "3.5900 | \n", "0.1501 | \n", "2075 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.2315 | \n", "0.2554 | \n", "1740 | \n", "
| 1.8222 | \n", "0.3509 | \n", "2390 | \n", "
| 2.3953 | \n", "0.2446 | \n", "1666 | \n", "
| 3.5721 | \n", "0.1491 | \n", "1016 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 8.00e+00 | \n", "2.1158 | \n", "0.0537 | \n", "742 | \n", "
| 8.00e+00 < x <= 1.20e+01 | \n", "1.8220 | \n", "0.0477 | \n", "659 | \n", "
| 1.20e+01 < x <= 1.50e+01 | \n", "1.8590 | \n", "0.0613 | \n", "847 | \n", "
| 1.50e+01 < x <= 1.80e+01 | \n", "1.9547 | \n", "0.0989 | \n", "1367 | \n", "
| 1.80e+01 < x <= 2.20e+01 | \n", "1.9739 | \n", "0.0871 | \n", "1205 | \n", "
| 2.20e+01 < x <= 2.50e+01 | \n", "2.1055 | \n", "0.0705 | \n", "975 | \n", "
| 2.50e+01 < x <= 2.80e+01 | \n", "2.0512 | \n", "0.0775 | \n", "1072 | \n", "
| 2.80e+01 < x <= 3.10e+01 | \n", "2.0439 | \n", "0.0682 | \n", "943 | \n", "
| 3.10e+01 < x <= 3.30e+01 | \n", "2.0275 | \n", "0.0575 | \n", "795 | \n", "
| 3.30e+01 < x <= 3.50e+01 | \n", "2.0651 | \n", "0.0722 | \n", "999 | \n", "
| 3.50e+01 < x <= 3.70e+01 | \n", "2.0750 | \n", "0.0687 | \n", "950 | \n", "
| 3.70e+01 < x <= 4.20e+01 | \n", "2.0102 | \n", "0.0811 | \n", "1121 | \n", "
| 4.20e+01 < x <= 4.50e+01 | \n", "2.1301 | \n", "0.0485 | \n", "670 | \n", "
| 4.50e+01 < x | \n", "2.4785 | \n", "0.1072 | \n", "1483 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.0205 | \n", "0.0526 | \n", "358 | \n", "
| 1.7827 | \n", "0.0443 | \n", "302 | \n", "
| 1.8780 | \n", "0.0556 | \n", "379 | \n", "
| 1.9391 | \n", "0.0986 | \n", "672 | \n", "
| 2.0285 | \n", "0.0891 | \n", "607 | \n", "
| 2.1179 | \n", "0.0759 | \n", "517 | \n", "
| 2.1634 | \n", "0.0743 | \n", "506 | \n", "
| 1.9546 | \n", "0.0664 | \n", "452 | \n", "
| 2.0512 | \n", "0.0565 | \n", "385 | \n", "
| 2.1862 | \n", "0.0755 | \n", "514 | \n", "
| 2.0747 | \n", "0.0659 | \n", "449 | \n", "
| 2.0174 | \n", "0.0895 | \n", "610 | \n", "
| 2.0015 | \n", "0.0489 | \n", "333 | \n", "
| 2.4651 | \n", "0.1069 | \n", "728 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 2.20e+01 | \n", "1.9494 | \n", "0.3486 | \n", "4820 | \n", "
| 2.20e+01 < x <= 3.70e+01 | \n", "2.0623 | \n", "0.4147 | \n", "5734 | \n", "
| 3.70e+01 < x <= 4.50e+01 | \n", "2.0550 | \n", "0.1295 | \n", "1791 | \n", "
| 4.50e+01 < x | \n", "2.4785 | \n", "0.1072 | \n", "1483 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.9447 | \n", "0.3403 | \n", "2318 | \n", "
| 2.0964 | \n", "0.4144 | \n", "2823 | \n", "
| 2.0118 | \n", "0.1384 | \n", "943 | \n", "
| 2.4651 | \n", "0.1069 | \n", "728 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 3.44e+00 | \n", "1.9126 | \n", "0.0500 | \n", "692 | \n", "
| 3.44e+00 < x <= 3.79e+00 | \n", "1.8286 | \n", "0.0500 | \n", "691 | \n", "
| 3.79e+00 < x <= 4.06e+00 | \n", "1.8169 | \n", "0.0500 | \n", "692 | \n", "
| 4.06e+00 < x <= 4.28e+00 | \n", "1.8418 | \n", "0.0500 | \n", "691 | \n", "
| 4.28e+00 < x <= 4.46e+00 | \n", "1.7529 | \n", "0.0500 | \n", "691 | \n", "
| 4.46e+00 < x <= 4.62e+00 | \n", "1.7915 | \n", "0.0500 | \n", "692 | \n", "
| 4.62e+00 < x <= 4.79e+00 | \n", "1.8214 | \n", "0.0500 | \n", "691 | \n", "
| 4.79e+00 < x <= 4.94e+00 | \n", "1.7685 | \n", "0.0500 | \n", "691 | \n", "
| 4.94e+00 < x <= 5.09e+00 | \n", "1.7466 | \n", "0.0500 | \n", "692 | \n", "
| 5.09e+00 < x <= 5.23e+00 | \n", "1.7717 | \n", "0.0500 | \n", "691 | \n", "
| 5.23e+00 < x <= 5.38e+00 | \n", "1.8664 | \n", "0.0500 | \n", "691 | \n", "
| 5.38e+00 < x <= 5.53e+00 | \n", "1.8472 | \n", "0.0500 | \n", "692 | \n", "
| 5.53e+00 < x <= 5.69e+00 | \n", "1.9199 | \n", "0.0500 | \n", "691 | \n", "
| 5.69e+00 < x <= 5.86e+00 | \n", "1.9910 | \n", "0.0500 | \n", "691 | \n", "
| 5.86e+00 < x <= 6.06e+00 | \n", "2.0870 | \n", "0.0500 | \n", "692 | \n", "
| 6.06e+00 < x <= 6.27e+00 | \n", "2.1908 | \n", "0.0500 | \n", "691 | \n", "
| 6.27e+00 < x <= 6.54e+00 | \n", "2.4050 | \n", "0.0500 | \n", "691 | \n", "
| 6.54e+00 < x <= 6.95e+00 | \n", "2.6874 | \n", "0.0500 | \n", "692 | \n", "
| 6.95e+00 < x <= 7.65e+00 | \n", "3.1129 | \n", "0.0500 | \n", "691 | \n", "
| 7.65e+00 < x | \n", "3.1718 | \n", "0.0500 | \n", "692 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.8659 | \n", "0.0518 | \n", "353 | \n", "
| 1.8728 | \n", "0.0505 | \n", "344 | \n", "
| 1.7627 | \n", "0.0524 | \n", "357 | \n", "
| 1.8020 | \n", "0.0543 | \n", "370 | \n", "
| 1.7223 | \n", "0.0552 | \n", "376 | \n", "
| 1.6802 | \n", "0.0452 | \n", "308 | \n", "
| 1.7707 | \n", "0.0530 | \n", "361 | \n", "
| 1.8030 | \n", "0.0443 | \n", "302 | \n", "
| 1.8209 | \n", "0.0523 | \n", "356 | \n", "
| 1.8326 | \n", "0.0437 | \n", "298 | \n", "
| 1.7923 | \n", "0.0550 | \n", "375 | \n", "
| 1.9388 | \n", "0.0514 | \n", "350 | \n", "
| 1.9465 | \n", "0.0501 | \n", "341 | \n", "
| 2.0248 | \n", "0.0468 | \n", "319 | \n", "
| 2.1049 | \n", "0.0483 | \n", "329 | \n", "
| 2.2239 | \n", "0.0490 | \n", "334 | \n", "
| 2.4339 | \n", "0.0467 | \n", "318 | \n", "
| 2.7667 | \n", "0.0468 | \n", "319 | \n", "
| 3.1001 | \n", "0.0548 | \n", "373 | \n", "
| 3.2429 | \n", "0.0483 | \n", "329 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 5.69e+00 | \n", "1.8220 | \n", "0.6500 | \n", "8988 | \n", "
| 5.69e+00 < x <= 6.27e+00 | \n", "2.0896 | \n", "0.1500 | \n", "2074 | \n", "
| 6.27e+00 < x <= 6.95e+00 | \n", "2.5463 | \n", "0.1000 | \n", "1383 | \n", "
| 6.95e+00 < x | \n", "3.1424 | \n", "0.1000 | \n", "1383 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.8162 | \n", "0.6593 | \n", "4491 | \n", "
| 2.1194 | \n", "0.1442 | \n", "982 | \n", "
| 2.6006 | \n", "0.0935 | \n", "637 | \n", "
| 3.1670 | \n", "0.1031 | \n", "702 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 9.4000e-01 | \n", "2.0684 | \n", "0.0500 | \n", "692 | \n", "
| 9.4000e-01 < x <= 9.6724e-01 | \n", "2.0735 | \n", "0.0500 | \n", "691 | \n", "
| 9.6724e-01 < x <= 9.8319e-01 | \n", "2.2167 | \n", "0.0501 | \n", "693 | \n", "
| 9.8319e-01 < x <= 9.9576e-01 | \n", "2.1706 | \n", "0.0499 | \n", "690 | \n", "
| 9.9576e-01 < x <= 1.0066e+00 | \n", "2.1310 | \n", "0.0500 | \n", "692 | \n", "
| 1.0066e+00 < x <= 1.0154e+00 | \n", "2.2358 | \n", "0.0500 | \n", "691 | \n", "
| 1.0154e+00 < x <= 1.0247e+00 | \n", "2.1668 | \n", "0.0500 | \n", "691 | \n", "
| 1.0247e+00 < x <= 1.0331e+00 | \n", "2.2102 | \n", "0.0500 | \n", "692 | \n", "
| 1.0331e+00 < x <= 1.0412e+00 | \n", "2.1295 | \n", "0.0500 | \n", "691 | \n", "
| 1.0412e+00 < x <= 1.0495e+00 | \n", "2.1548 | \n", "0.0500 | \n", "691 | \n", "
| 1.0495e+00 < x <= 1.0576e+00 | \n", "2.1238 | \n", "0.0500 | \n", "692 | \n", "
| 1.0576e+00 < x <= 1.0665e+00 | \n", "2.1025 | \n", "0.0500 | \n", "691 | \n", "
| 1.0665e+00 < x <= 1.0768e+00 | \n", "2.0704 | \n", "0.0500 | \n", "691 | \n", "
| 1.0768e+00 < x <= 1.0878e+00 | \n", "2.0664 | \n", "0.0501 | \n", "693 | \n", "
| 1.0878e+00 < x <= 1.1003e+00 | \n", "2.1118 | \n", "0.0499 | \n", "690 | \n", "
| 1.1003e+00 < x <= 1.1161e+00 | \n", "1.9937 | \n", "0.0500 | \n", "691 | \n", "
| 1.1161e+00 < x <= 1.1382e+00 | \n", "1.9405 | \n", "0.0500 | \n", "691 | \n", "
| 1.1382e+00 < x <= 1.1738e+00 | \n", "1.7990 | \n", "0.0500 | \n", "692 | \n", "
| 1.1738e+00 < x <= 1.2732e+00 | \n", "1.9162 | \n", "0.0500 | \n", "691 | \n", "
| 1.2732e+00 < x | \n", "1.6515 | \n", "0.0500 | \n", "692 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.0416 | \n", "0.0539 | \n", "367 | \n", "
| 2.2043 | \n", "0.0527 | \n", "359 | \n", "
| 2.0997 | \n", "0.0482 | \n", "328 | \n", "
| 2.1835 | \n", "0.0487 | \n", "332 | \n", "
| 2.2628 | \n", "0.0552 | \n", "376 | \n", "
| 2.1619 | \n", "0.0480 | \n", "327 | \n", "
| 2.2295 | \n", "0.0567 | \n", "386 | \n", "
| 2.1690 | \n", "0.0493 | \n", "336 | \n", "
| 2.1581 | \n", "0.0528 | \n", "360 | \n", "
| 2.1202 | \n", "0.0476 | \n", "324 | \n", "
| 2.1039 | \n", "0.0452 | \n", "308 | \n", "
| 2.1595 | \n", "0.0509 | \n", "347 | \n", "
| 2.1037 | \n", "0.0521 | \n", "355 | \n", "
| 2.0662 | \n", "0.0484 | \n", "330 | \n", "
| 2.0487 | \n", "0.0489 | \n", "333 | \n", "
| 1.9543 | \n", "0.0467 | \n", "318 | \n", "
| 1.8871 | \n", "0.0484 | \n", "330 | \n", "
| 1.8680 | \n", "0.0499 | \n", "340 | \n", "
| 1.8371 | \n", "0.0465 | \n", "317 | \n", "
| 1.7182 | \n", "0.0498 | \n", "339 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 1.058e+00 | \n", "2.1528 | \n", "0.5500 | \n", "7606 | \n", "
| 1.058e+00 < x <= 1.100e+00 | \n", "2.0878 | \n", "0.2000 | \n", "2765 | \n", "
| 1.100e+00 < x <= 1.138e+00 | \n", "1.9671 | \n", "0.0999 | \n", "1382 | \n", "
| 1.138e+00 < x | \n", "1.7888 | \n", "0.1501 | \n", "2075 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.1597 | \n", "0.5583 | \n", "3803 | \n", "
| 2.0954 | \n", "0.2004 | \n", "1365 | \n", "
| 1.9201 | \n", "0.0951 | \n", "648 | \n", "
| 1.8072 | \n", "0.1462 | \n", "996 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 3.53e+02 | \n", "1.9859 | \n", "0.0501 | \n", "693 | \n", "
| 3.53e+02 < x <= 5.14e+02 | \n", "2.1616 | \n", "0.0501 | \n", "693 | \n", "
| 5.14e+02 < x <= 6.27e+02 | \n", "2.1117 | \n", "0.0501 | \n", "693 | \n", "
| 6.27e+02 < x <= 7.15e+02 | \n", "2.2819 | \n", "0.0497 | \n", "687 | \n", "
| 7.15e+02 < x <= 7.93e+02 | \n", "2.0335 | \n", "0.0509 | \n", "704 | \n", "
| 7.93e+02 < x <= 8.64e+02 | \n", "2.2113 | \n", "0.0492 | \n", "681 | \n", "
| 8.64e+02 < x <= 9.38e+02 | \n", "2.0772 | \n", "0.0498 | \n", "689 | \n", "
| 9.38e+02 < x <= 1.02e+03 | \n", "2.1386 | \n", "0.0500 | \n", "692 | \n", "
| 1.02e+03 < x <= 1.09e+03 | \n", "2.0430 | \n", "0.0503 | \n", "696 | \n", "
| 1.09e+03 < x <= 1.17e+03 | \n", "2.0506 | \n", "0.0496 | \n", "686 | \n", "
| 1.17e+03 < x <= 1.26e+03 | \n", "2.0870 | \n", "0.0505 | \n", "698 | \n", "
| 1.26e+03 < x <= 1.35e+03 | \n", "2.0195 | \n", "0.0497 | \n", "687 | \n", "
| 1.35e+03 < x <= 1.46e+03 | \n", "2.0004 | \n", "0.0502 | \n", "694 | \n", "
| 1.46e+03 < x <= 1.58e+03 | \n", "2.1102 | \n", "0.0498 | \n", "688 | \n", "
| 1.58e+03 < x <= 1.73e+03 | \n", "2.0346 | \n", "0.0500 | \n", "691 | \n", "
| 1.73e+03 < x <= 1.91e+03 | \n", "1.9139 | \n", "0.0499 | \n", "690 | \n", "
| 1.91e+03 < x <= 2.15e+03 | \n", "2.0006 | \n", "0.0500 | \n", "691 | \n", "
| 2.15e+03 < x <= 2.56e+03 | \n", "2.0707 | \n", "0.0500 | \n", "692 | \n", "
| 2.56e+03 < x <= 3.30e+03 | \n", "1.9614 | \n", "0.0500 | \n", "691 | \n", "
| 3.30e+03 < x | \n", "2.0428 | \n", "0.0500 | \n", "692 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 1.9012 | \n", "0.0530 | \n", "361 | \n", "
| 2.1915 | \n", "0.0520 | \n", "354 | \n", "
| 2.1706 | \n", "0.0523 | \n", "356 | \n", "
| 2.1062 | \n", "0.0514 | \n", "350 | \n", "
| 2.2019 | \n", "0.0531 | \n", "362 | \n", "
| 2.1765 | \n", "0.0490 | \n", "334 | \n", "
| 2.2025 | \n", "0.0506 | \n", "345 | \n", "
| 2.1329 | \n", "0.0553 | \n", "377 | \n", "
| 2.1744 | \n", "0.0437 | \n", "298 | \n", "
| 2.1319 | \n", "0.0480 | \n", "327 | \n", "
| 1.9939 | \n", "0.0534 | \n", "364 | \n", "
| 2.0096 | \n", "0.0465 | \n", "317 | \n", "
| 1.9569 | \n", "0.0465 | \n", "317 | \n", "
| 1.9756 | \n", "0.0504 | \n", "343 | \n", "
| 2.0815 | \n", "0.0496 | \n", "338 | \n", "
| 2.0272 | \n", "0.0461 | \n", "314 | \n", "
| 1.9789 | \n", "0.0487 | \n", "332 | \n", "
| 1.9355 | \n", "0.0496 | \n", "338 | \n", "
| 2.0714 | \n", "0.0518 | \n", "353 | \n", "
| 2.0157 | \n", "0.0487 | \n", "332 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 6.27e+02 | \n", "2.0864 | \n", "0.1503 | \n", "2079 | \n", "
| 6.27e+02 < x <= 8.64e+02 | \n", "2.1743 | \n", "0.1498 | \n", "2072 | \n", "
| 8.64e+02 < x <= 2.15e+03 | \n", "2.0433 | \n", "0.5498 | \n", "7602 | \n", "
| 2.15e+03 < x | \n", "2.0250 | \n", "0.1501 | \n", "2075 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.0867 | \n", "0.1572 | \n", "1071 | \n", "
| 2.1618 | \n", "0.1536 | \n", "1046 | \n", "
| 2.0607 | \n", "0.5390 | \n", "3672 | \n", "
| 2.0084 | \n", "0.1502 | \n", "1023 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 1.870e+00 | \n", "2.7122 | \n", "0.0500 | \n", "692 | \n", "
| 1.870e+00 < x <= 2.067e+00 | \n", "2.6633 | \n", "0.0500 | \n", "691 | \n", "
| 2.067e+00 < x <= 2.225e+00 | \n", "2.3373 | \n", "0.0500 | \n", "692 | \n", "
| 2.225e+00 < x <= 2.338e+00 | \n", "2.3080 | \n", "0.0500 | \n", "691 | \n", "
| 2.338e+00 < x <= 2.432e+00 | \n", "2.1976 | \n", "0.0500 | \n", "691 | \n", "
| 2.432e+00 < x <= 2.513e+00 | \n", "2.2064 | \n", "0.0500 | \n", "692 | \n", "
| 2.513e+00 < x <= 2.595e+00 | \n", "2.1736 | \n", "0.0500 | \n", "691 | \n", "
| 2.595e+00 < x <= 2.668e+00 | \n", "2.1862 | \n", "0.0500 | \n", "691 | \n", "
| 2.668e+00 < x <= 2.743e+00 | \n", "2.1378 | \n", "0.0500 | \n", "692 | \n", "
| 2.743e+00 < x <= 2.820e+00 | \n", "2.1902 | \n", "0.0500 | \n", "691 | \n", "
| 2.820e+00 < x <= 2.898e+00 | \n", "2.1824 | \n", "0.0500 | \n", "691 | \n", "
| 2.898e+00 < x <= 2.984e+00 | \n", "2.0741 | \n", "0.0500 | \n", "692 | \n", "
| 2.984e+00 < x <= 3.073e+00 | \n", "2.0255 | \n", "0.0501 | \n", "693 | \n", "
| 3.073e+00 < x <= 3.171e+00 | \n", "1.9914 | \n", "0.0498 | \n", "689 | \n", "
| 3.171e+00 < x <= 3.282e+00 | \n", "1.8992 | \n", "0.0500 | \n", "692 | \n", "
| 3.282e+00 < x <= 3.425e+00 | \n", "1.8926 | \n", "0.0500 | \n", "691 | \n", "
| 3.425e+00 < x <= 3.607e+00 | \n", "1.7085 | \n", "0.0500 | \n", "691 | \n", "
| 3.607e+00 < x <= 3.877e+00 | \n", "1.5666 | \n", "0.0500 | \n", "692 | \n", "
| 3.877e+00 < x <= 4.325e+00 | \n", "1.4505 | \n", "0.0500 | \n", "691 | \n", "
| 4.325e+00 < x | \n", "1.4294 | \n", "0.0500 | \n", "692 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.7684 | \n", "0.0484 | \n", "330 | \n", "
| 2.5334 | \n", "0.0435 | \n", "296 | \n", "
| 2.3989 | \n", "0.0542 | \n", "369 | \n", "
| 2.3641 | \n", "0.0533 | \n", "363 | \n", "
| 2.2272 | \n", "0.0546 | \n", "372 | \n", "
| 2.2969 | \n", "0.0489 | \n", "333 | \n", "
| 2.3179 | \n", "0.0508 | \n", "346 | \n", "
| 2.0793 | \n", "0.0467 | \n", "318 | \n", "
| 2.1847 | \n", "0.0521 | \n", "355 | \n", "
| 2.1752 | \n", "0.0504 | \n", "343 | \n", "
| 2.0762 | \n", "0.0533 | \n", "363 | \n", "
| 2.0535 | \n", "0.0501 | \n", "341 | \n", "
| 2.0535 | \n", "0.0528 | \n", "360 | \n", "
| 1.9477 | \n", "0.0458 | \n", "312 | \n", "
| 1.8397 | \n", "0.0449 | \n", "306 | \n", "
| 1.8861 | \n", "0.0514 | \n", "350 | \n", "
| 1.7301 | \n", "0.0448 | \n", "305 | \n", "
| 1.6200 | \n", "0.0499 | \n", "340 | \n", "
| 1.4423 | \n", "0.0527 | \n", "359 | \n", "
| 1.4596 | \n", "0.0515 | \n", "351 | \n", "
| \n", " | target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|---|
| x <= 2.22e+00 | \n", "2.5709 | \n", "0.1501 | \n", "2075 | \n", "
| 2.22e+00 < x <= 3.07e+00 | \n", "2.1681 | \n", "0.5001 | \n", "6915 | \n", "
| 3.07e+00 < x <= 3.61e+00 | \n", "1.8729 | \n", "0.1998 | \n", "2763 | \n", "
| 3.61e+00 < x | \n", "1.4822 | \n", "0.1501 | \n", "2075 | \n", "
| target_mean | \n", "frequency | \n", "count | \n", "
|---|---|---|
| 2.5615 | \n", "0.1461 | \n", "995 | \n", "
| 2.1836 | \n", "0.5129 | \n", "3494 | \n", "
| 1.8527 | \n", "0.1869 | \n", "1273 | \n", "
| 1.5056 | \n", "0.1541 | \n", "1050 | \n", "
| \n", " | \n", " | \n", " | \n", " | \n", " | content | \n", "target_mean | \n", "frequency | \n", "dropped | \n", "dropped_reason | \n", "
|---|---|---|---|---|---|---|---|---|---|
| feature | \n", "count | \n", "kruskal | \n", "n_mod | \n", "label | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
| Quantitative('MedInc') | \n", "3457.0 | \n", "6037.182135 | \n", "4 | \n", "0 | \n", "x <= 2.57e+00 | \n", "1.231421 | \n", "0.250000 | \n", "False | \n", "None | \n", "
| 4840.0 | \n", "6037.182135 | \n", "4 | \n", "1 | \n", "2.57e+00 < x <= 3.97e+00 | \n", "1.801562 | \n", "0.350014 | \n", "False | \n", "None | \n", "|
| 3456.0 | \n", "6037.182135 | \n", "4 | \n", "2 | \n", "3.97e+00 < x <= 5.54e+00 | \n", "2.358660 | \n", "0.249928 | \n", "False | \n", "None | \n", "|
| 2075.0 | \n", "6037.182135 | \n", "4 | \n", "3 | \n", "5.54e+00 < x | \n", "3.590040 | \n", "0.150058 | \n", "False | \n", "None | \n", "|
| Quantitative('HouseAge') | \n", "4820.0 | \n", "160.599610 | \n", "4 | \n", "0 | \n", "x <= 2.20e+01 | \n", "1.949361 | \n", "0.348568 | \n", "False | \n", "None | \n", "
| 5734.0 | \n", "160.599610 | \n", "4 | \n", "1 | \n", "2.20e+01 < x <= 3.70e+01 | \n", "2.062306 | \n", "0.414666 | \n", "False | \n", "None | \n", "|
| 1791.0 | \n", "160.599610 | \n", "4 | \n", "2 | \n", "3.70e+01 < x <= 4.50e+01 | \n", "2.055043 | \n", "0.129520 | \n", "False | \n", "None | \n", "|
| 1483.0 | \n", "160.599610 | \n", "4 | \n", "3 | \n", "4.50e+01 < x | \n", "2.478542 | \n", "0.107246 | \n", "False | \n", "None | \n", "|
| Quantitative('AveRooms') | \n", "8988.0 | \n", "1401.052572 | \n", "4 | \n", "0 | \n", "x <= 5.69e+00 | \n", "1.821999 | \n", "0.649986 | \n", "False | \n", "None | \n", "
| 2074.0 | \n", "1401.052572 | \n", "4 | \n", "1 | \n", "5.69e+00 < x <= 6.27e+00 | \n", "2.089595 | \n", "0.149986 | \n", "False | \n", "None | \n", "|
| 1383.0 | \n", "1401.052572 | \n", "4 | \n", "2 | \n", "6.27e+00 < x <= 6.95e+00 | \n", "2.546315 | \n", "0.100014 | \n", "False | \n", "None | \n", "|
| 3 | \n", "6.95e+00 < x | \n", "3.142406 | \n", "0.100014 | \n", "False | \n", "None | \n", "||||
| Quantitative('AveBedrms') | \n", "7606.0 | \n", "320.789845 | \n", "4 | \n", "0 | \n", "x <= 1.058e+00 | \n", "2.152832 | \n", "0.550043 | \n", "False | \n", "None | \n", "
| 2765.0 | \n", "320.789845 | \n", "4 | \n", "1 | \n", "1.058e+00 < x <= 1.100e+00 | \n", "2.087773 | \n", "0.199957 | \n", "False | \n", "None | \n", "|
| 1382.0 | \n", "320.789845 | \n", "4 | \n", "2 | \n", "1.100e+00 < x <= 1.138e+00 | \n", "1.967066 | \n", "0.099942 | \n", "False | \n", "None | \n", "|
| 2075.0 | \n", "320.789845 | \n", "4 | \n", "3 | \n", "1.138e+00 < x | \n", "1.788831 | \n", "0.150058 | \n", "False | \n", "None | \n", "|
| Quantitative('Population') | \n", "2079.0 | \n", "16.109709 | \n", "4 | \n", "0 | \n", "x <= 6.27e+02 | \n", "2.086394 | \n", "0.150347 | \n", "False | \n", "None | \n", "
| 2072.0 | \n", "16.109709 | \n", "4 | \n", "1 | \n", "6.27e+02 < x <= 8.64e+02 | \n", "2.174297 | \n", "0.149841 | \n", "False | \n", "None | \n", "|
| 7602.0 | \n", "16.109709 | \n", "4 | \n", "2 | \n", "8.64e+02 < x <= 2.15e+03 | \n", "2.043255 | \n", "0.549754 | \n", "False | \n", "None | \n", "|
| 2075.0 | \n", "16.109709 | \n", "4 | \n", "3 | \n", "2.15e+03 < x | \n", "2.024995 | \n", "0.150058 | \n", "False | \n", "None | \n", "|
| Quantitative('AveOccup') | \n", "2075.0 | \n", "991.408301 | \n", "4 | \n", "0 | \n", "x <= 2.22e+00 | \n", "2.570888 | \n", "0.150058 | \n", "False | \n", "None | \n", "
| 6915.0 | \n", "991.408301 | \n", "4 | \n", "1 | \n", "2.22e+00 < x <= 3.07e+00 | \n", "2.168126 | \n", "0.500072 | \n", "False | \n", "None | \n", "|
| 2763.0 | \n", "991.408301 | \n", "4 | \n", "2 | \n", "3.07e+00 < x <= 3.61e+00 | \n", "1.872867 | \n", "0.199812 | \n", "False | \n", "None | \n", "|
| 2075.0 | \n", "991.408301 | \n", "4 | \n", "3 | \n", "3.61e+00 < x | \n", "1.482183 | \n", "0.150058 | \n", "False | \n", "None | \n", "
| \n", " | info | \n", "kruskal | \n", "combination | \n", "n_mod | \n", "dropna | \n", "train | \n", "viable | \n", "dev | \n", "
|---|---|---|---|---|---|---|---|---|
| 0 | \n", "Raw distribution (n_mod=20>max_n_mod=4) | \n", "1062.072498 | \n", "{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n", "20 | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| 1 | \n", "Not viable | \n", "994.514410 | \n", "{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n", "4 | \n", "False | \n", "{'viable': True, 'info': ''} | \n", "False | \n", "{'viable': False, 'info': 'Non-representative ... | \n", "
| 2 | \n", "Not viable | \n", "994.504665 | \n", "{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n", "4 | \n", "False | \n", "{'viable': True, 'info': ''} | \n", "False | \n", "{'viable': False, 'info': 'Non-representative ... | \n", "
| 3 | \n", "Not viable | \n", "991.504255 | \n", "{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n", "4 | \n", "False | \n", "{'viable': True, 'info': ''} | \n", "False | \n", "{'viable': False, 'info': 'Non-representative ... | \n", "
| 4 | \n", "Best for kruskal and max_n_mod=4 | \n", "991.408301 | \n", "{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n", "4 | \n", "False | \n", "{'viable': True, 'info': ''} | \n", "True | \n", "{'viable': True, 'info': ''} | \n", "
| \n", " | MedInc | \n", "HouseAge | \n", "AveRooms | \n", "AveBedrms | \n", "Population | \n", "AveOccup | \n", "
|---|---|---|---|---|---|---|
| 0.0 | \n", "0.255432 | \n", "0.340282 | \n", "0.659278 | \n", "0.558280 | \n", "0.157223 | \n", "0.146066 | \n", "
| 1.0 | \n", "0.350851 | \n", "0.414416 | \n", "0.144157 | \n", "0.200382 | \n", "0.153553 | \n", "0.512918 | \n", "
| 2.0 | \n", "0.244568 | \n", "0.138432 | \n", "0.093511 | \n", "0.095126 | \n", "0.539049 | \n", "0.186876 | \n", "
| 3.0 | \n", "0.149149 | \n", "0.106870 | \n", "0.103053 | \n", "0.146213 | \n", "0.150176 | \n", "0.154140 | \n", "
| \n", " | feature | \n", "Nan | \n", "Mode | \n", "KruskalEtaSquaredMeasure | \n", "KruskalEtaSquaredRank | \n", "TschuprowtFilter | \n", "TschuprowtWith | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "Quantitative('MedInc') | \n", "0.0000 | \n", "0.3500 | \n", "0.4365 | \n", "0.0000 | \n", "0.0000 | \n", "itself | \n", "
| 2 | \n", "Quantitative('AveRooms') | \n", "0.0000 | \n", "0.6500 | \n", "0.1011 | \n", "1.0000 | \n", "0.3854 | \n", "MedInc | \n", "
| 5 | \n", "Quantitative('AveOccup') | \n", "0.0000 | \n", "0.5001 | \n", "0.0715 | \n", "2.0000 | \n", "0.1620 | \n", "AveRooms | \n", "
| 3 | \n", "Quantitative('AveBedrms') | \n", "0.0000 | \n", "0.5500 | \n", "0.0230 | \n", "3.0000 | \n", "0.1395 | \n", "MedInc | \n", "
| 1 | \n", "Quantitative('HouseAge') | \n", "0.0000 | \n", "0.4147 | \n", "0.0114 | \n", "4.0000 | \n", "0.1345 | \n", "AveRooms | \n", "
| 4 | \n", "Quantitative('Population') | \n", "0.0000 | \n", "0.5498 | \n", "0.0009 | \n", "5.0000 | \n", "0.1464 | \n", "HouseAge | \n", "
| \n", " | MedInc | \n", "AveRooms | \n", "AveOccup | \n", "AveBedrms | \n", "HouseAge | \n", "Population | \n", "
|---|---|---|---|---|---|---|
| 5088 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "2.0 | \n", "0.0 | \n", "1.0 | \n", "
| 17096 | \n", "2.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "2.0 | \n", "
| 5617 | \n", "1.0 | \n", "0.0 | \n", "3.0 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "
| 20060 | \n", "0.0 | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "1.0 | \n", "2.0 | \n", "
| 895 | \n", "2.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "3.0 | \n", "
XGBRegressor(base_score=None, booster=None, callbacks=None,\n",
" colsample_bylevel=None, colsample_bynode=None,\n",
" colsample_bytree=None, device=None, early_stopping_rounds=None,\n",
" enable_categorical=False, eval_metric=None, feature_types=None,\n",
" feature_weights=None, gamma=None, grow_policy=None,\n",
" importance_type=None, interaction_constraints=None,\n",
" learning_rate=None, max_bin=None, max_cat_threshold=None,\n",
" max_cat_to_onehot=None, max_delta_step=None, max_depth=None,\n",
" max_leaves=None, min_child_weight=None, missing=nan,\n",
" monotone_constraints=None, multi_strategy=None, n_estimators=None,\n",
" n_jobs=None, num_parallel_tree=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.