Conversation
|
I need to retrieve tweets over 2 weeks 4x in 2018. I have the academic access. I only want to get a random sample of 10k tweets per day. All the solutions I see are about sampling based on user ID, I want to sample in terms of tweet count per day so that I get tweets from random times during each day. Is this possible? Thank you! |
|
Yes, maybe - my current prototype implementation for this is still forthcoming unfortunately. |
I see, in the meantime, I am using a loop that iterates over days, hours, and minutes for chunks of 5 seconds - in case that's helpful to other folks trying to get a sample of tweets. I look forward to seeing your implementation! |
|
Hi @digi686. Would you happen to have the code available for your "randomising" loop? I too have academic access and I'm looking to take a sample based on a hashtag search over a 10-year period for sentiment analysis. |
|
@igorbrigadir I wonder if it makes sense to release your prototype as a plugin while it is in development? |
Hi @troyneilson, sure! Here's my loop. I put it inside the main function, before defining my query. Since my last comment, I opted for chunks of 2 seconds to reduce the volume of tweets. Hope it helps. |
|
Thanks heaps for that, really appreciated.
… On 7 Aug 2022, at 9:45 pm, msa-digi ***@***.***> wrote:
|
For #453, second attempt at #459
Twitter "sample" stream is based on selecting tweets with ids where the millisecond timestamp matches a defined range.
Use
since_idanduntil_idparameters and snowflake id tricks to simulate asample:operator that samples tweets based on millisecond time windows.--samplecommand line option can apply to any endpoint that has a since / until id option.The idea is to is to accept an integer between 1 and 100 to get a sample of n% of tweets, or
--sample gardenhoseor--sample spritzeror--sample v1(alias for--sample 1and--sample spritzer) or--sample v2which is also a 1% sample but with different sampling windows as far as i can tell.I still have to make sure my assumptions are correct - but so far the millisecond ranges are like this: