Data Sampling Problem

Обмеження: 10 сек., 1024 МіБ

Scoring function for this problem has been updated. Please refer to the Scoring section of the statement to see changes.

MR (Measurement Report) data is one of the most important data for wireless network measurement. It is generated by UEs (User Equipment) periodically, every 10 or 5 seconds, depending on system configurations. To prevent congestion, the BS (Base Station) only randomly selects \(k\) UEs to report their measurements. Typically, \(k\) equals 10, which means that there will be at most 10 UEs uploading their measurements in each period. The measurement reports from a specific user have the same CallID. If the UE’s call ends, the BS will reselect UEs.

Each MR contains time, CellID and GridID, which specify the serving cell and the location of the UE at that time. Meanwhile, each MR contains the RSRP (reference signal received power), which specifies the signal strength for a specific location. After accumulating a day’s worth of data, we can generate a RSRP map for a cell to measure the overall coverage. Specifically, we calculate the average RSRP for each grid, the overall weak coverage ratio for the cell, etc.

The network management devices usually manage tens of thousands of BSs. A large number of MRs have to be processed in real-time. To reduce computing resources, we want to reduce the number of MRs by downsampling while maintaining management accuracy. There are two core issues that need to be addressed to implement such a reduction:

Sampling rate estimation (the sampling rate could be adaptive or fixed-size).
Sampling algorithm design.

Issue 1 directly defines the sampling results, both in terms of accuracy and the data reduction level. One must verify that the amount of sampled data is sufficient for business service purposes while keeping the sampling rate at its minimum. Issue 2 determines the sampling quality, i.e., it specifies which elements are to be dropped from the data while applying a particular sampling rate. This directly affects the subsequent business service KPI calculation.

In this problem, your task is to write a sampler algorithm. Given a set of MRs you have to select a subset of the reports that represent the original set of MRs in the best possible way. For more clarifications, please refer to the Scoring section of the statement.

Вхідні дані

The first line contains a single integer, \(n\) — the number of MRs. Each of the following \(n\) lines describe a single MR in the following format:

\(Time \ CellId \ CallId \ GridId_x \ GridId_y \ RSRP\)

Where:

\(Time\) is the time of the MR given in an integer number of milliseconds that passed from the beginning of the day of the observation.
\(CellId\) and \(CallId\) are both integer parameters of the MR;
\(GridId_x\) and \(GridId_y\) are the \(x\) and \(y\) integer coordinates of the MR. \(GridId\) of the MR is simply a pair of those two integers.
\(RSRP\) is an integer number, the RSRP value of the MR.

Integers are separated by single spaces.

Вихідні дані

In the first line, print a single positive integer, \(m\) — the number of samples that you selected. In the following line, print \(m\) integers — indexes of the selected samples in the input file. The numeration starts with 1.

Обмеження

\(1 \le n \le 2 \cdot 10^6\),

\(0 \le Time \le 10^8\),

\(0 \le CellId \le 10^3\),

There are no more than 10 different \(CellId\) values in a single test,

\(0 \le CallId \le 10^9\),

\(0 \le GridId_x, GtidId_y \le 5 \cdot 10^6\),

\(-200 \le RSRP \le 0\),

For each MR there are at least 20 other MRs with the same pair \((GridId, CellId)\),

There are no more than 5000 different \((GridId, CellId)\) pairs in a single test.

Приклади

Вхідні дані (stdin)	Вихідні дані (stdout)
7 56 112 2579 334245 40009 -69 76 112 2579 334245 40009 -107 232 112 4380 6155 168169 -79 888 112 2579 334245 40009 -150 1418 112 4380 6155 168169 -82 5058 112 4380 6155 168169 -88 1968 112 2579 334245 40009 -78	4 1 2 4 7

Примітки

Sample input violates some of the problem constraints and is provided just to clarify the input and output specifications. All of the tests in provisional and final sets are going to be valid tests that satisfy all of the constraints of this problem.

Scoring

Your answer for the test case is considered invalid if your output doesn’t match the format given in the Output section of the problem. In this case, the score for the test case is 0.

If your answer is correct, then the score is calculated in the following way:

Compression Ratio

Compression Ratio is calculated as:

\[CR = \frac{n}{m}.\]

Where \(n\) is the total number of MRs in the input, and \(m\) is the number of samples you selected.

RSRP Score

To calculate your Grid-level RSRP Absolute Error we need to group all the MRs with the same pairs \((GridId, CellId)\) and for each group, calculate the average RSRP. The RSRP score indicates how much those average values change compared to the original data and the sampled one. More formally:

Let \(RSRP_{gid, cid}\) be a set of \(RSRP\) for each MR with specific \(GridId\) and \(CellId\):

\[RSRP_{gid, cid} = \{mr.RSRP | mr.GridId = gid, mr.CellId = cid\}.\]

The average RSRP is the average value of the RSRP for a group:

\[\overline{RSRP_{gid, cid}} = \frac{1}{|RSRP_{gid, cid}|} \cdot \sum_{r \in RSRP_{gid, cid}} r.\]

Based on it, we can calculate the absolute RSRP error for each grid-cell.

\[RsrpErr_{gid, cid} = |\overline{RSRP_{gid, cid}^{or}} - \overline{RSRP_{gid, cid}^{sa}}|,\]

where \(\overline{RSRP_{gid, cid}^{or}}\) is calculated for the data from the input, and \(\overline{RSRP_{gid, cid}^{sa}}\) is calculated for the sampled data.

The RSRP score for each grid-cell is calculated as:

\[RsrpScore_{gid, cid} = \begin{cases} 1 - \frac{RsrpErr_{gid, cid}}{10}, & RsrpErr_{gid, cid} < 10,\\ 0, & RsrpErr_{gid, cid} \ge 10. \end{cases}\]

Finally, your RSRP score is calculated as an average RSRP score for each grid-cell:

\[RsrpScore = \frac{1}{C} \sum_{gid, cid} RsrpScore_{gid, cid},\]

Where \(C\) is the total number of grid-cell pairs in the input.

Weak Coverage Ratio Score

A weak coverage score indicates how well you represented the data with a small \(RSRP\) (less than -105) for each grid-cell. More formally:

Let \(W_{gid, cid}\) be the set of all RSRPs for each MR with specific \(GridId\) and \(CelId\) that have a small value of RSRP (less than -105):

\[W_{gid, cid} = \{mr.RSRP | mr.GridId = gid, mr.CellId = cid, mr.RSRP < -105 \}.\]

The weak coverage ratio for the grid-cell is a ratio between the number of MRs with a small value of RSRP and the total number of MRs in the group:

\[WCRatio_{gid, cid} = \frac{|W_{gid, cid}|}{|RSRP_{gid, cid}|}.\]

Based on it, we can calculate the weak coverage ratio error for each grid-cell:

\[WCRatioError_{gid, cid} = |WCRatio_{gid, cid}^{or} - WCRatio_{gid, cid}^{sa}|,\]

Where \(WCRatio_{gid, cid}^{or}\) is calculated for the data from the input and \(WCRatio_{gid, cid}^{sa}\) is calculated for the sampled data.

The weak coverage ratio score for a grid-cell is calculated as:

\[WCRatioScore_{gid, cid} = \begin{cases} 1 - \frac{WCRatioError_{gid, cid}}{0.2}, & WCRatioError_{gid, cid} < 0.2,\\ 0, & WCRatioError_{gid, cid} \ge 0.2. \end{cases}\]

Your weak coverage ratio score is the average weak coverage ratio score for each grid-cell:

\[WCRatioScore = \frac{1}{C} \sum_{gid, cid} WCRatioScore_{gid, cid},\]

Where \(C\) is the total number of grid-cell pairs in the input.

Final Score

Your score for the test case is calculated as:

\[Score = \left\lfloor \frac {min(CR, 20) \cdot (RsrpScore + WCRatioScore)}{2 \cdot 20} \cdot 10^7 \right\rfloor.\]

Please note, that this formula has been changed since the start of the competition.

Please notice that if for some \((gid, cid)\) grid-cell pair, you didn’t select any of the samples, the \(RsrpScore_{gid, cid}\) and \(WCRatioScore_{gid, cid}\) for this pair both are equal to 0.

Some of the MR properties are not used while calculating your score. For example \(CallId\) property is completely ignored. However, you can use all of the data given in the input for data analysis and potential additional optimizations.

Submissions

The execution time limit is 10 seconds per test case, and the memory limit is 1024 mebibytes.
The code size limit is 64 kibibytes.
The compilation time limit is 1 minute.
There are 40 provisional test cases. Your submissions will be evaluated on the provisional set during the submission phase.
The first 20 provisional tests are available for you to download for manual testing using the following link: https://drive.google.com/file/d/1Gqvv3m0byiDYCFLcEYwdTlczMOrMrdio/view.
You can submit your code once every 30 minutes, and you will get feedback with your score for each of the provisional tests.
There will be 360 test cases in the final testing after the submission phase is over. Please note that provisional tests are not included in the final testing. The final results will be announced in one week.

Quick start

Check the sample solution, which reads the data and selects every 10-th MR from the list. The source code is available for some of the contest programming languages:

Tests

All tests, including provisional and final sets, are generated based on the data from real base stations that manage measure reports for a city’s cellular network. Each test contains data for a roughly 24-hour period from a single base station, starting at midnight (please note that the period of observation for a single test might be slightly longer than 24 hours).

20 of the provisional tests are available for you to download using the link above for manual testing and exploring the data. Neither the provisional tests nor the tests that are available for download were hand-picked. They were all randomly selected from the pool of 400 tests.

Element Type	Створено	Хто	Задача	Компілятор	Результат	Час (сек.)	Пам'ять (МіБ)	№	Дії

Element Type	Is Unofficial	Місце	Хто	Група	Group Id	Розширена група	Group Ex Id	Бали	Штраф	Results	Is Current User	№	Дії	2023 - R2

2023 - R2. Data Sampling Problem | HOT • Practice